Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) ISSN 2249-6831 Vol.

3, Issue 1, Mar 2013 39-48 TJPRC Pvt. Ltd.,

EFFICIENT CLUSTERING OF DATASET BASED ON PARTICLE SWARM OPTIMIZATION


SURESH CHANDRA SATAPATHY1 & ANIMA NAIK2
1 2

Sr. Member IEEE, ANITS, Vishakapatnam, Andhra Pradesh, India

Majhighariani Institute of Technology and Sciences, Rayagada, Odisha India

ABSTRACT
The Automatic Particle Swarm Optimization (AUTO-PSO) clustering algorithm can generate more compact clustering results than the traditional K-means clustering algorithm. However, when clustering high dimensional datasets, the AUTO-PSO clustering algorithm is notoriously slow because its computation cost increases exponentially with the size of the dataset dimension. Dimensionality reduction techniques offer solutions that both significantly improve the computation time, and yield reasonably accurate clustering results in high dimensional data analysis. In this paper, we present an idea of an algorithm that can combine dimensionality reduction techniques of weighted PCs with Auto PSO for clustering. The intention is to reduce complexity of datasets and speed up the Auto PSO clustering process. We report significant improvements in total runtime. Moreover, the clustering accuracy of the dimensionality reduction AUTO-PSO clustering algorithm is comparable to the one that uses full dimension space.

KEYWORDS: Clustering, Particle Swarm Optimization, Principal Component, Dimension Reduction INTRODUCTION
The clustering high dimensional data sets find applications in many areas. Because traditional data clustering algorithms tend to be biased towards local optimum when applied to high dimensional data sets, Particle Swarm Optimization (PSO) has been used for solving data clustering problems in recent years D. W. van der Merwe et al.( 2003), M. G. H. Omran et al.(2006), X. Cui et al.(2005), Ching-Yi Chen et al.(2006) . Many researchers D. W. van der Merwe et al.( 2003), M. G. H. Omran et al.(2006), X. Cui et al.(2005) have indicated that when utilizing the PSO algorithms optimal ability, and given sufficient time, PSO generates a more compact clustering result from low dimensional data than the traditional K-means clustering algorithm as K-MEANS clustering needs prior knowledge of the data to be classified. PSO has emerged as one of the fast, robust, and efficient global search heuristics of current interest. The PSO with the automatic clustering of large unlabeled data sets requires no prior knowledge of the data to be classified. Rather, it determines the optimal number of partitions of the data on the run. according to Ching-Yi Chen et al.(2006). However, when clustering high dimensional datasets, the PSO clustering algorithm is notoriously slow because the algorithm needs to repeatedly compute high dimension data vector similarities. Researchers have found that dimensionality reduction techniques offer solutions that both improve the computation time, and still yield accurate results in some high dimensional data analysis according to C. Ding et al.(2002). It is therefore highly desirable to reduce the dimensionality of a high dimension dataset before clustering in order to maintain tractability. Since reducing dataset dimensionality may result in a loss of information, the lower dimension representation generated by a dimensionality reduction algorithm must be a good approximation of the full high dimensional dataset. In this regards the weighted principal components (PCs) with a thresholding algorithm is good one, where the weighted PC is obtained by the weighted sum of the first k PCs of interest which are found from PCA algorithm according to Seoung Bum Kim et al.(2010). Each of the k loading values in the weighted PC reflects the contribution of each

40

Suresh Chandra Satapathy & Anima Naik

individual feature and thresholding algorithm that identifies the significant features and reduces the dimension of data set. In this paper we combine both PSO and Weighted PCs dimensional reduction techniques to get an efficient clustering algorithm for clustering data. We use some real life datasets both small as well as high dimensional datasets as our experimental data. The rest of the paper is organized as follows: we introduce basic concepts of Dimensionality reduction technique Weighted PCs, Moving Range-Based thresholding Algorithm Particle Swarm optimization (PSO), the automatic clustering PSO algorithm, in section 2. ACDRDPSO algorithm (Automatic clustering on dimensional reduced data using PSO) is discussed in section 3. Detailed simulation and results are presented in section 4. We conclude with a summary of the contributions of this paper in section 5.

BASIC CONCEPTS
Dimensionality Reduction Technique Weighted PCs PCA is one of the most widely used multivariate data analysis techniques and is employed primarily for dimensional reduction and visualization according to Jolliffe, I. T. (2002). PCA extracts a lower dimensional feature set that can explain most of the variability within the original data. The extracted features, PCis (Yi) are each a linear combination of the original features with the loading values ( , i, j=1,2,,p). The Yis can be represented as follows:

(1)

The loading values represent the importance of each feature in the formation of a PC. For example,

indicates

the degree of importance of the jth feature in the ith PC. A two-dimensional loading plot (e.g., PC1 vs PC2 loading plot) may provide a graphical display for identification of important features in the first and second PC domains. However, the interpretation of a two-dimensional loading plot is frequently subjective, particularly in the presence of a large number of features. Moreover, in some situations, consideration of only the first few PCs may be insufficient to account for most of the variability in the data. Determination of the appropriate number of PCs (=k) to retain can be subjective. One can use a scree plot that visualizes the proportion of variability of each PC to determine the appropriate number of PCs .If a PCA loading value for the jth original feature can be computed from the first k PCs, the importance of the jth feature can be represented as follows: ,j=1,2,,p where k is the total number of features of interest and determine (2) represents the weight of ith PC. The typical way to can be called a weighted PC loading

is to compute the proportion of total variance explained by the ith PC.

for the feature j. A feature with a large value of indicates a significant feature. In the next section, we will present a systematic

way to obtain a threshold that determines the significance of each feature.

Efficient Clustering of Dataset Based on Particle Swarm Optimization

41

Moving Range-Based Thresholding Algorithm A moving range-based thresholding algorithm as a way to identify the significant features from the weighted PC loadings discussed in the above. The main idea of a moving range-based thresholding algorithm comes from a moving average control chart that has been widely used in quality control according to Vermaat et al.(2003). A control chart provides a comprehensive graphical display for monitoring the performance of a process over time so as to keep the process within control limits according to Woodal et al. (1999). A typical control chart comprises monitoring statistics and the control limit. When the monitoring statistics exceed (or fall below) the control limit, an alarm is generated so that proper remedial action can be taken. A moving range control chart is useful when the sample size used for process monitoring is one. Moreover, the average moving range control charts perform reasonably well when the observations deviate moderately from the normal distribution according to Woodal et al.(1999). In this problem, it can consider the weighted PC loading values as the monitoring statistics. Thus, it plot these loading values on the moving range control chart and identify the significant features when the corresponding weighted PC loading exceeds the control limit (threshold). Given a set of the weighted PC loading values for individual features), ( the threshold can be calculated as follows according to Vermaat et al.(2003),: (3) Where, = is the inverse standard normal cumulative distribution function, and is the Type I error rate can be

that can be specified by the user. The range of is between 0 and 1. In typical moving range control charts, estimated by , calculated by the average of the moving ranges of two successive observations.

(4) However, in our feature selection problems, because the weighted PC loading values for individual features are not ordered, we cannot simply use (4). To address this issue, we propose a different way of computing observations that can properly handle a set of unordered observations. Given the fact that there is no specific order of , they are randomly reshuffled, , , and are recalculated. , where N is the number of features. The for

Therefore, we obtain a set of unordered observations is calculated by

(5) Finally, the threshold of the feature selection method can be obtained by the following equation: (6) A feature is reported as significant if the corresponding weighted PC loading exceeds the threshold . Particle Swarm Optimization PSO can be considered as a swarm-based learning scheme. In PSO learning process, each single solution is a bird referred to as a particle. The individual particles fly gradually towards the positions of their own and their neighbors best previous experiences in a huge searching space. It shows that the PSO gives more opportunity to fly into desired areas to

42

Suresh Chandra Satapathy & Anima Naik

get better solutions. Therefore, PSO can discover reasonable solutions much faster. PSO define a proper fitness function that evaluates the quality of every particles position. The position, called the global best (gbest), is the one which has the highest value among the entire swarm. The location, called it as personal best (pbest), is the one which has each particles best experience. Based on every particles momentum and the influence of both personal best (pbest) and global best (gbest) solutions, every particle adjusts its velocity vector at each iteration. The PSO learning formula is described as follows, (7) (8) Where m is the dimensional number, i denote the ith particle in the population, V is the velocity vector, X is the position vector and is the inertia factor, and are the cognitive and social lerning rates respectively. These two rates

control the relative influence of the memory of particle and neighborhood. Auto-PSO Clustering Algorithm Suppose that we are given a data set of N data points { }, where is

the ith data point in n-dimensional space. The detailed process of the AUTO-POS clustering algorithm is described below. The initial population P=[ ] is the made up of pop_size possible particles(i.e.

solutions), and each string is a sequence of real numbers representing K cluster centers of the candidates cluster centers. In an n-dimensional space, the length of a particle is determined as (T X n)+T. the formula of particle is: (9)

i=1,2,..,n, p {1,2,,pop_size}

Where T is the user-defined positive number, which denotes the maximum cluster number in the cluster set generated by the individual pth particle, and pop_size denotes the population size. it is noted that the selected cluster number will be located between 2 and T. cluster centers, i.e., [0,1] indicate the selected threshold value for the associated jth candidate . In this case, the proposed cluster centers selected rules with its related

threshold valure are used to determine the active cluster center in initial populations, which is defined by IF 0.5 THEN the jth candidate cluster center is ACTIVE is INACTIVE (10)

ELSE IF

0.5 THEN the jth candidate cluster center

The fitness of a particle is computed with the CS measure. The CS measure is defined as CS(K) =

(11)

, i=1,2,,K

(12)

(13)

Efficient Clustering of Dataset Based on Particle Swarm Optimization

43

Where and

is the cluster center of

, and

is the set whose elements are the data points assigned to the ith cluster,

is the number of elements in

, d denotes a distance function. This measure is a function of the ratio of the sum of

within-cluster scatter to between-cluster separation. The objective of the PSO is to minimize the CS measure for achieving proper clustering results. The fitness function for each individual particle is computed by F= Where is the CS measure computed for the ith particle, and eps is a very small-valued constant. (14)

Generally, the number of cluster centers is known prior and then applied with a traditional clustering algorithm to verify the correctness of the cluster validity measure. Here some candidate solutions are computed and generated by the population based evolutional PSO learning process to meet unconsidered situations when executing the formula of CS measure. In order to avoid such unconsidered condition happening, we first calculate the counting number is smaller than 2. If so, the new cluster centers position of this special particle should be updated by using the concept of average computation.

PROPOSED ACDRDPSO ALGORITHM (AUTOMATIC CLUSTERING ON DIMENSIONAL REDUCED DATA USING PSO)
We first need to introduce the basic principles. The proposed algorithm works in two stages: 1. First stage: As we are using dimensionality reduction techniques as data preprocessing, the first stage will be the dimensional reduction technique. Here weighted PCs based on moving ranged base thresholding algorithm is the dimensional reduction technique. 2. Second stage: In this stage Auto_PSO clustering algorithm is used for clustering reduced datasets. Then the Pseudocode for the complete ACDRDPSO algorithm is given as Pseudocode for First Stage 1. PCA extracts a lower dimensional feature set that can explain most of the variability within the original data. The , i,

extracted features, PCis (Yi) are each a linear combination of the original features with the loading values (

j=1,2,,p). The Yis can be represented as equation (1). The loading value represent the importance of each feature in the formation of PC 2. 3. Determination of actual number of PCs(=k) to retain , use scree plot. PCA loading value for the jth original feature can be computed from the first k PCs, the importance of the jth feature can be represented as follows: ,j=1,2.,p where k is the total number of features of interest and proportion of total variance explained by the ith PC. 4. represents the weight of ith PC . The is to compute the

can be called a weighted PC loading for the feature j. the threshold can be

Given a set of the weighted PC loading values for individual features, ( calculated using equation (6), where can calculated using equation (5).

5.

Identify the significant features when the corresponding weighted PC loading exceeds the control limit(threshold). Using those significant features makes a reduced data set.

44

Suresh Chandra Satapathy & Anima Naik

At the end of this stage we will get reduced data set, which will be input for second stage of our proposed algorithm. Pseudocode for Second Stage 1. Set the number of particles (pop_size), the maximum mumber of clusters T, and the constants for the PSO algorithm =[ =[ ]=[ ]. ( ) and then ], randomly where initialize =[ position ] vector and ,

are randomly generated by

j=1,2 .,T, i=1,2,,n The range of the is defined as [ ], where is the maximum and minimum values of the

ith dimension for a given data set , respectively. each velocity randomly vector is expressed as by

{0,1}, j=1,2, 3,.,T is generated with random function. The form of , where , j=1,2 .,T, i=1,2,,n and are ,

generated

j=1,2,,T Where is a positive number and experimental value is

2. Select the active solution (i.e. active cluster centers) of every particle from initial population with equation (16). 3. Determine unreasonable solutions and refine them by the simple average computation. 4. Evaluate the fitness function for every individual particle. 5. Compare p every particles fitness value with the previous particles best solution { ,

, and then close the new particles best solution

which has the highest

fitness value. The formula is as follows: 6. Compare the fitness value of every particles best solution gbest(t),then replace the gbest(t) with the current particles solution by gbest(t+1)= 7. Regulate velocities and positions for every particle by equation (7) and (8). 8. Repeat step 2 to 7 until the predefined number of iteration is completed. 9. Determine the optimal cluster number, real cluster centers and final classification result. with the previous overall best solution

EXPERIMENT AND RESULTS


We run dimensionality reduction technique to reduce the dataset as lower dimensional dataset. We apply clustering on the data before and after the dimensionality reduction to verify and compare the results. The accuracy of the clustering results and the runtime of the algorithms will be compared. The runtime for the dimensionality reduction based AUTO-PSO algorithm will include the dimensionality reduction time and the AUTO-PSO clustering time. Experimental Setup The parameters of the AUTO-PSO algorithm for all examples are defined as Pop_size = 20, and the acceleration coefficient constants c1=1.5, c2 = 1.5, an inertia weight initially set = 0.75, lamda=20;

Efficient Clustering of Dataset Based on Particle Swarm Optimization

45

Datasets Used The following real-life data sets are used in this paper which are taken from UCI Machine Repository. The Iris plants database, Glass, Wisconsin breast cancer data set, Wine, Vowel data set, Pima Diabates data, Haberman's Survival Data Set Population Initialization For the AUTO-PSO algorithm, we randomly initialize the activation thresholds (control genes) within [0, 1]. The cluster centroids are also randomly fixed between Xmax and Xmin, which denote the maximum and minimu m numerical values of any feature of the data set under test, respectively. To make the comparison fair, the populations for both the AUTO-PSO clustering algorithm and proposed ACDRDPSO algorithm (for all problems tested) were initialized using the same random seeds. Simulation Strategy In this paper, while comparing the performance of our proposed ACDRDPSO algorithm with AUTO-PSO clustering techniques, we focus on two major issues: as 1) ability to find the optimal number of clusters; and 2) computational time required to find the solution. For comparing the speed of the algorithms, the first thing we require is a fair time measurement. The number of iterations or generations cannot be accepted as a time measure since the algorithms perform different amount of works in their inner loops, and they have different population sizes. Hence, we choose the number of fitness function evaluations (FEs) as a measure of computation time instead of generations or iterations. Since the algorithms are stochastic in nature, the results of two successive runs usually do not match. Hence, we have taken 30 independent runs (with different seeds of the random number generator) of each algorithm. The results have been stated in terms of the mean values and standard deviations over the 30 runs in each case. Finally, we would like to point out that all the experiment codes are implemented in MATLAB. The experiments are conducted on a Pentium 4, 1GB memory desktop in Windows XP 2002 environment. Experimental Results To judge the accuracy of the AUTO-PSO and ACDRDPSO, we let each of them run for a very long time over every benchmark data set, until the number of FEs exceeded 106. Then, we note the number of clusters found in table I. Table 1: Final Solution (Mean and Standard Deviation Over 30 Independent Runs) after each Algorithm was Terminated after Running for 104 Fes, with the Cs-Measure-Based Fitness Function
Data Sets Name wine data Iris data Breast cancer data Pima Diabates data Haberman's Survival Data Set Glass data Vowel data Algorithm Used AUTO-PSO ACDRDPSO AUTO-PSO ACDRDPSO AUTO-PSO ACDRDPSO AUTO-PSO ACDRDPSO AUTO-PSO ACDRDPSO AUTO-PSO ACDRDPSO AUTO-PSO ACDRDPSO Average Number of Cluster Found 3.0333 0.413 3.03330 0.7184 3.0000 .5872 3.5667 0.8171 2.1333 0.3457 2.0000 0 2.200 0.5509 2.0000 0 2.333 0.4795 2.30 0.4661 5.9667 0.7649 6.00 0.7428 5.9667 0.7184 6.00 0.5872 6 2 2 6 Actual Number of Clusters 3 3 2

46

Suresh Chandra Satapathy & Anima Naik

To compare the speeds of different algorithms, we selected a threshhold value of CS measure for each of the data sets. This cutoff CS value is somewhat larger than the minimum CS value found by each algorithm in Table II. Now, we run a clustering algorithm on each data set and stop as soon as the algorithm achieves the proper number of clusters, as well as the CS cutoff value. We then note down the number of fitness FEs that the algorithm takes to yield the cutoff CS value. A lower number of FEs corresponds to a faster algorithm. Table 2: Mean and Standard Deviations of the Number of Fitness Fes (Over 30 Independent Runs) Required by each Algorithm to Reach a Predefined Cutoff Value of the CS Validity Index
Data Sets Name wine data Iris data Breast cancer data Pima Diabates data Haberman's Survival Data Glass data Algorithm Used AUTO-PSO ACDRDPSO AUTO-PSO ACDRDPSO AUTO-PSO ACDRDPSO AUTO-PSO ACDRDPSO AUTO-PSO ACDRDPSO AUTO-PSO ACDRDPSO AUTO-PSO ACDRDPSO Mean number of Function Evaluation Required 608 56.3028 129.80 98.6012 598.55 55.4128 110.40 66.2034 197.80 35.6258 38 10.6536 626.40 91.2266 65.20 12.5579 417.40 382.4223 33 11.5109 2628 317.0489 873.60 41.5126 1950.80 66.6911 910.60 68.0206 2.50 1.80 1.10 1.00 0.90 CS Cutoff Value 1.90 0.95

Vowel data

DISCUSSIONS ON RESULTS
The Tables II reveals the fact that the ACDRDPSO algorithm is faster algorithm in compare to AUTO-PSO where as ACDRDPSO run on reduce data and AUTO-PSO run on original high dimensional data.

CONCLUSIONS AND FUTURE SCOPE


We have presented a new method of clustering in high-dimensional datasets. The proposed method combines PCA techniques and a moving range-based thresholding algorithm with automatic clustering of PSO. We first obtained the weighted PC, which can be calculated by the weighted sum of the first k PCs of interest. Each of the k loading values in the weighted PC reflects the contribution of each individual feature. To identify the significant features, we proposed a moving-range thresholding algorithm. Features are considered to be significant if the corresponding weighted PC loadings exceed the threshold obtained by a moving-range thresholding algorithm. Using this technique we get reduced dataset of original high dimensional data set. On that reduced data we applied automatic clustering PSO to get cluster. Our experimental results with real datasets will demonstrate that the proposed method could successfully find the true clusters of high dimensional data set on less computation. Our study extends the application scope of ACDRDPSO algorithm. We hope that the procedure discussed here stimulates further investigation into development of better procedures for clustering of high dimensional datasets.

REFERENCES
1. C. Ding, H. Xiaofeng, Z. Hongyuan, and H. D. Simon (2002), "Adaptive dimension reduction for clustering high dimensional data," Maebashi City, , pp. 147-54.

Efficient Clustering of Dataset Based on Particle Swarm Optimization

47

2.

Ching-Yi Chen,Hsuan-Ming Feng and Fun Ye (2006), Automatic Particles Swarm Optimization Clustering Algorithm, International Journal of Electrical Engineering,vol.13, No.4 pp. 379-387.

3.

C.W. van der Merwe and A. P. Engelbrecht, "Data clustering using particle swarm optimization," in 2003 Congress on Evolutionary Computation, 8-12 Dec. 2003, Canberra, ACT, Australia, 2003, pp. 215-220.

4. 5.

Jolliffe, I. T. (2002), Principal Component Analysis, Springer-Verlag, New York. M. G. H. Omran, A. Salman, and A. P. Engelbrecht (2006), "Dynamic clustering using particle swarm optimization with application in image segmentation," Pattern Analysis and Applications, vol. 8, pp. 332-44, 2006.

6.

Seoung Bum Kim and Panaya Rattakorn (2010) , Unsupervised Feature Selection Using Weighted Principal Components.

7.

Vermaat, M. B., Ion, R. A., Does, R. J. M. M., and Klaassen, C. A. J. (2003), A comparison of Shewhart individuals control charts based on normal, non-parametric, and extreme-value theory, Quality and Reliability Engineering International, Vol. 19, pp. 337-353.

8.

Woodall, W.H. and Montgomery, D.C. (1999), Research issues and ideas in statistical process control, Journal of Quality Technology, Vol. 31, pp. 376-386.

9.

X. Cui, T. E. Potok, and P. Palathingal (2005), "Document clustering using particle swarm optimization," IEEE Swarm Intelligence Symposium, Pasadena, CA, USA, pp. 185-191.

You might also like