Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
56 views

Normalization Based K Means Clustering Algorithm

The document proposes a normalization-based K-means clustering algorithm called N-K means. It summarizes previous work that applied techniques like normalization and improved initialization of centroids to enhance K-means clustering. The proposed N-K means algorithm pre-processes and normalizes the data before applying K-means clustering. It calculates initial centroids based on weighted averages of the dataset attributes. Experimental results show N-K means performs better than traditional K-means in terms of complexity and performance.

Uploaded by

Antonio D'agata
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views

Normalization Based K Means Clustering Algorithm

The document proposes a normalization-based K-means clustering algorithm called N-K means. It summarizes previous work that applied techniques like normalization and improved initialization of centroids to enhance K-means clustering. The proposed N-K means algorithm pre-processes and normalizes the data before applying K-means clustering. It calculates initial centroids based on weighted averages of the dataset attributes. Experimental results show N-K means performs better than traditional K-means in terms of complexity and performance.

Uploaded by

Antonio D'agata
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Normalization based K means Clustering Algorithm

Deepali Virmani1,Shweta Taneja2,Geetika Malhotra3


1
Department of Computer Science,Bhagwan Parshuram Institute of Technology,New Delhi
Email:deepalivirmani@gmail.com
2
Department of Computer Science,Bhagwan Parshuram Institute of Technology,New Delhi
Email:shweta_taneja08@yahoo.co.in
3
Department of Computer Science,Bhagwan Parshuram Institute of Technology,New Delhi
Email:geets002@gmail.com

Abstract- K-means is an effective clustering partitioning. In hierarchical clustering, the clusters


technique used to separate similar data into groups are combined based on their proximity or how
based on initial centroids of clusters. In this paper, close they are. This combination is prevented when
Normalization based K-means clustering further process leads to undesirable clusters. In
algorithm(N-K means) is proposed. Proposed N-K partition clustering approach, one dataset is
means clustering algorithm applies normalization separated into definite number of small sets in a
prior to clustering on the available data as well as single iteration[10]. The accuracy and quality of
the proposed approach calculates initial centroids clustering results depends how the algorithms are
based on weights. Experimental results prove the implemented and their ability to find hidden
betterment of proposed N-K means clustering knowledge.
algorithm over existing K-means clustering
algorithm in terms of complexity and overall There are various clustering algorithms based on
performance. the nature of generated clusters and techniques.
Few of them are BIRCH(Balanced iterative
Keywords- Clustering, Data mining, K means, reducing and clustering using
Normalization, Weighted Average hierarchies),CURE(Clustering using
representatives),K-means, genetic K-means, Clara,
I. INTRODUCTION Dbscan,Clarans etc[6]. The most widely used
clustering algorithm is the K- means algorithm.
Data mining[7][11]or knowledge discovery is a This algorithm is used in many practical
process of analysing large amounts of data and applications.It works by selecting the initial
extracting useful information. It is an important number of clusters and initial centroids[7][13]. We
technology which is used by industries as a novel have chosen K-means algorithm over other
approach to mine data. Data mining tools and clustering algorithms as it very efficient in
techniques are used to generate effective results processing large data sets. It often terminates at a
which was earlier difficult and time local optimum and generates tighter clusters than
consuming.Data mining is widely used in various hierarchical clustering, especially if clusters are
areas like financial data analysis, retail and globular. It is a popular algorithm because of its
telecommunication industry, biological data observable speed and simplicity.But K-means has a
analysis, fraud detection, spatial data analysis and major disadvantage that it does not work well with
other scientific applications. clusters of different size and different density.
Moreover initial centroids are chosen randomly due
Clustering is a technique of data mining in
to which clusters produced vary from one run to
which similar objects are grouped into clusters.
another. Also various datapoints exist on which K-
Clustering techniques are widely used in various
means takes superpolynomial time[8][5].
domains like information retrieval, image
processing,etc[1][2].There are two types of Different researchers have put forward various
approaches in clustering: hierarchical and methods to improve the efficiency and time of K-
means algorithm. K-means uses the concept of Authors[3], have proposed data preprocessing
Euclidean distance to calculate the centroids of the techniques like cleaning and normalization to
clusters. This method is less effective when new produce optimum quality clusters. In normalization
data sets are added and have no effect on the the data to be analyzed is scaled to a specific range.
measured distance between various data objects. A modified k means algorithm is proposed which
The computational complexity of k means provides a solution for automatic initialization of
algorithm is also very high[1][9].Also, K-means is centroids and performance is enhanced using
unable to handle noisy data and missing normalization. This techqnique overcomes many
values.Data preprocessing techniques are often drawbacks of naive k means algorithm.
applied to the datasets to make them more clean,
consistent and noise free. Normalization is used to Some authors have proposed their methods to
identify initial centroids. In the following work[1],
eliminate redundant data and ensures that good
authors proposed a novel method to find better
quality clusters are generated which can improve initial centroids as well as more accurate clusters
the efficiency of clustering algorithms.So it with less computational time. This method was
becomes an essential step before clustering as adopted to find weighted average score of dataset
Euclidean distance is very sensitive to the changes by averaging the value of attribute of each data
in the differences[3]. point to generate initial centroids.
In another work, Authors[8] proposed a new k
This paper is organized as follows:in Section II, means clustering method with improved initial
a description of the literature survey is done in centre. In this method, initial cluster centres are
which we have covered the work done by various selected and the centres are used as input to the k
authors to improve K-means clustering algorithm. means. The user is not required to give the number
Then in Section III, our proposed N-K means of clusters as input.
algorithm is described stepwise followed by In another research as done by authors[4], a data
experimental results in Section IV, where a clustering approach is proposed which works by
comparison is shown between traditional K-means partitioning the space into different segments and
clustering algorithm and N-K means clustering calculating the frequency of data point in each
segment and the segment which shows maximum
algorithm. Lastly, the conclusions are addressed in
frequency of data point has maximum chances to
Section V. contain the centroid of the cluster.The authors have
introduced concept of threshold distance for each
II. LITERATURE SURVEY cluster’s centroid for comparing the distance
between data point and cluster’s centroid and using
A lot of methods and techniques have been this method, efforts to calculate the distance
proposed over the past few years to improve the between data point and cluster’s centroid is
accuracy of the algorithm and there is a need to minimized. This algorithm effectively decreases
optimize it to have good results.This section the complexity and makes calculations easier.
discusses the various approaches proposed by
researchers to find better initial centroids in k
means algorithm .

III. PROPOSED N- K MEANS ALGORITHM


K-means algorithm can generate better results after 3.1 DATA PRE - PROCESSING
the modification of the databases. We apply the It is a very important step and should be adopted in
modified algorithm with calculation of initial clustering as this method uses concepts like
centroids based on weighted average score of constant, average, minimum, maximum, standard
dataset. Next, we preprocess and normalize dataset deviation to calculates missing values in the
before we apply the N-K means algorithm. This tuples[7]. These missing values need to be avoided
proposed method works in three stages.During the for accurate results. Preprocessing involves steps
first stage,data preprocessing technique is adopted like data cleaning, data integration, data
that transforms raw data into understandable transformation, data reduction and data
format. During the second stage, normalization is discretization.
performed to standardize the data objects into
specific range. During the third stage we apply the 3.2 NORMALIZATION
N-K means algorithm to generate clusters. Data Mining can generate effective results if
normalization is applied to the dataset. It is a
process used to standardize all the attributes of the 3.3 INITIAL CENTROIDS
dataset and give them equal weight so that CALCULATION
redundant or noisy objects can be eliminated and
We use a uniform method to find score by taking
there is valid and reliable data which enhances the
the average of the attribute of each data point
accuracy of the result. K-Means algorithm uses
which will generate initial centroids that follow the
Euclidean distance that is highly prone to
data distribution of the given set. A sorting
irregularities in the size of various features[3].
algorithm is applied to the score of each data point
There are various data normalization methods like
and then divided into k subsets where k is the
Min-Max, Z-Score and Decimal Scaling.The best
number of clusters. Finally the nearest value of
normalization method depends on the data to be
mean from each subset is taken as initial centroid.
normalized. Here, we have used Min-Max
In this method we have introduced a weight with
normalization technique in our algorithm because
each attribute, which makes the method
our dataset is limited and has not much variability
advantageous as it can cause enhancement of any
between minimum and maximum. Min-Max
feature of the dataset by increasing the weight
normalization technique performs a linear
related to that attribute.
transformation on the data. In this method, we fit
the data in a predefined boundary or in a predefined The algorithm is given in Fig 1:
interval.

ALGORITHM 1: Steps of N-K means Algorithm

INPUT: A dataset with d dimensions

OUTPUT: Clusters

1. Load initial data set.


2. Find the maximum and minimum values of each feature from the dataset.
3. Normalize real scalar values of datasets with maximum and minimum values using
equation : v’ = v-min(e) (1) where,
max(e)- min(e)
min(e) and max(e) are the minimum and the maximum values for attribute E.

4. Pass the number of clusters and generate initial centroids using algorithm 2.
5. Generate clusters.

Figure 1: shows the steps of N-K means algorithm

ALGORITHM 2: Initialization of centroids

1. Calculate the average score of each data point.


1)di= x1, x2,x3,x4…xn
2)di(avg)=(w1*x1+w2*x2+w3*x3+…..wm*xm)/m where x=
attribute’s value , m= no of attributes,w= weight to multiply to
ensure fair distribution of cluster.

2. Sort the data based on average score .

3. Divide the data based on k subsets.

4. Calculate the mean value of each subset.

5. Take the nearest possible data point of the mean as the initial centroid
for each data subsets.

Figure 2: Algorithm to calculate initial number of centroids


IV. EXPERIMENTAL RESULTS

Our experiment was conducted on Iris data set [12] k. From the comparisons we can make out that N-K
from UCI Machine Learning Repository for means algorithm outperforms the traditional K-
evaluating the performance of N-K means means algorithm in terms of parameters namely
clustering algorithm. In this section, we represent a execution time and speed. Hence the algorithm
comparative analysis of traditional K-means computationally runs faster as it executes in less
clustering algorithm with N-K means algorithm. number of iterations and the complexity is reduced.
Both the algorithms are run for different values of The results are depicted in Table 1.
Table 1: Performance Comparison of N-K means Algorithm With Existing K-means

Value of k Algorithm Time taken Speed


(ms)
1 K means 0.078 5.1
N- K means 0.065 3.5
3 K-means 0.094 6.2
N-K means 0.081 4.7
5 K-means 0.125 6.6
N-K means 0.103 5.0
7 K-means 0.134 7.2
N-K means 0.117 5.7

0.2 8
0.15 6
TIME

SPEED

0.1 4
K-means K-means
0.05
2
0 N-K means N-K means
0
1 3 5 7
1 3 5 7
No. of clusters
No. of clusters

Figure 3: shows comparison between K-means and


Figure 4: shows comparison between K-means and
N-K means on the basis of time
N-K means on the basis of speed
V. CONCLUSION

The K-means clustering algorithm is widely used


REFERENCES
for clustering huge data sets. But traditional k
means algorithm does not always generate good [1] Md Sohrab Mahmud, Md. Mostafizer Rahman,
quality results as automatic initialization of Md. Nasim Akhtar, “ Improvement of K-means
centroids affects final clusters.This paper presents Clustering algorithm with better initial centroids
an efficient algorithm where we have first based on weighted average”, 7th International
preprocessed our dataset based on normalization Conference on Electrical and Computer
technique and then generated effective clusters. Engineering, 2012, pp. 647-650.
[2] Madhu Yedla,Srinivasa Rao Pathakota,TM
This is done by assigning weights to each attribute
Srinivasa, “Enhancing K-means Clustering
value to achieve standardization. Our algorithm has Algorithm with Improved Initial Center”,
proved to be better than traditional K-means International Journal of Computer Science and
algorithm in terms of execution time and speed. Information
Technologies(IJCSIT),Vol.1(2),2010,pp. 121-125.
[3]Vaishali Rajeev Patel, Rupa G. Mehta, Center”, Second International Workshop on
“Performance Analysis of MK-means Clustering Knowledge Discovery and Data Mining,2009, pp.
Algorithm with Normalization Approach”, World 790-792.
Congress on Information and Communication [9] K. A.Abdul Nazeer, M.P.Sebastian ,”Improving
Technologies, 2011, pp. 974-979. the Accuracy and Efficiency of the k-means
[4] Ran Vijay Singh, M.P.S Bhatia, “Data Clustering Algorithm”, Proceedings of the World
Clustering With Modified K means Algorithm”, Congress On Engineering 2009 Vol I, WCE 2009,
IEEE-International Conference on Recent Trends pp. 308-312.
in Information Technology, ICRTIT 2011, pp. [10] Margaret H.Dunham, “Data Mining-
717-721. Introductory and Advanced Concepts”, Pearson
[5] David Arthur & Sergei Vassilvitskii , "How Education,2006.
Slow is the k means Method?", Proceedings of the [11] R. Agrawal, T. Imielinksi and A. Swami,
22nd Symposium on Computational Geometry “Mining association rules between sets of items in
(SoCG),2006, pp. 144-153. large database”, The ACM SIGMOD Conference,
[6] J. Han and M.Kamber, “Data Mining Concepts Washington DC, USA, 1993, pp. 207-216.
and Techniques”, Morgan Kaufmann [12] UCI Repository of Machine Learning
Publishers,SanDiego, 2001. Databases,
[7]Vaishali R. Patel, Rupa G. Mehta, “Impact of Available: archive.ics.uci.edu/ml/
Outlier Removal and Normalization Approach in [13] McQueen J, “Some methods for classification
Modified k-Means Clustering Algorithm”, IJCSI and analysis of ultivariate observations,” Proc. 5th
International Journal of Computer Science Issues, Berkeley Symp. Math. Statist. Prob., Vol. 1, 1967,
Vol. 8, Issue 5, No 2, 2011, pp. 331-336. pp. 281–297.
[8]Zhang Chen, Xia Shixiong, “K-means
Clustering Algorithm with improved Initial

You might also like