Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Parallel K-Means Clustering Algorithm on DNA Dataset

2004
Clustering is a division of data into groups of similar objects. K-means has been used in many clustering work because of the ease of the algorithm. Our main effort is to parallelize the k-means clustering algorithm. The parallel version is implemented based on the inherent parallelism during the Distance Calculation and Centroid Update phases. The parallel K-means algorithm is designed in such a way that each P participating node is responsible for handling n/P data points. We run the program on a Linux Cluster with a maximum of eight nodes using message-passing programming model. We examined the performance based on the percentage of correct answers and its speed-up performance. The outcome shows that our parallel K-means program performs relatively well on large datasets....Read more
Parallel K-Means Clustering Algorithm on DNA Dataset Fazilah Othman, Rosni Abdullah, Nur'Aini Abdul Rashid, and Rosalina Abdul Salam School of Computer Science, Universiti Sains Malaysia, lIS' {fazot, rosni, nuraini, rosalina}@f Abstract. Clustering is a division of data into groups of similar objects. means has been used in many clustering work because of the ease of the algo- rithm. Our main effort is to parallelize the k-means clustering algorithm. The parallel version is implemented based on the inherent parallelism during the Distance Calculation and Centroid Update phases. The parallel K-means algo- rithm is designed in such a way that each P participating node is responsible for handling nIP data points. We run the program on a Linux Cluster with a maxi- mum of eight nodes using message-passing programming model. We examined the perfonnance based on the percentage of correct answers and its speed·up performance. The outcome shows that our parallel K-means program perlonns relatively well on large datasets. 1 Introduction The objective of this work is to partition data into groups of similar items. Given a set of meaningless data and sets of representatives, our work will group the data accord- ing to the nearest representative. This work is useful in helping scientists explore new data and lead them to new discovery in relationships between data. It is widely em- ployed in different disciplines which involve grouping massive data such as computa- tional biology, botany, medicine, astronomy, marketing and image processing. A survey [1][2][3] on clustering algorithm reported that K-means is a popular, effective and practically feasible method widely applied by scientists. However, the rapid growth of data makes the processing time increase due to large computation time. [2] has implemented the K-means algorithm on DNA data using positional weight matrices (PWM) training The decreasing prices of personal computers make parallel implementation a practical approach. In this paper, we propose a parallel implementa- tion of K-means clustering algorithm on a cluster of personal computers. This is to provide a practical and economically feasible solution. 2 K..Means Clustering Algorithm K-means algorithm works conveniently with numerical values and offers clear geo- metric representations. The basic K-means algorithm requires time proportionate to number of patterns and number of cluster per iteration. This is computationally ex- pensive especially for large datasets [4]. To address these problems, parallelization K.-M. Liew et al. (Eds.): PDCAT 2004, LNCS 3320, pp. 248-251, 2004. © Berlin Heidelberg 2004
Parallel K-Means Clustering Algorithm on DNA Dataset 249 has become a popular alternative which exploits the inherent data parallelism within sequential algorithm. Efforts in parallelizing the K-means algorithm has been done by [1][5][6][7][8] in areas such as image processing, medicine, astronomy marketing and biology. As our contribution we propose a parallel K-means clustering algorithm for DNA dataset running on a cluster of personal computers. 3 Parallel K-Means Clustering Algorithm The sequential algorithm spends much of it's time calculating new centroid and cal- culating the distances between n data points and k centroids. We can cut down the execution time by parallelizing these two operations. Our parallel algo- rithm is parallelized based on the inherent data-parallelism especially in the Distance Calculation and Centroid Update operations. The Distance Calculation operation can be executed asynchronously and in parallel for each data point (x.for lSi:::; n). We designed the parallel program in such a way that each participating P proces- sor is responsible for handling nIP data points. The basic idea is to divide the n data points into P parts which are the approximate size for the portion of data which will be processed by the P independent nodes. However, each of the P nodes must update and store the mean and k latest cen- troid in the local cache. The master node will accumulate new assigned data points from each worker node and broadcast new global mean to all. The k centroids allow each node to perform distance calculation operation in parallel while the global mean pennit each node to decide on the convergence condition independently. The Centroid Update is performed in parallel. It is operated before the new itera- tion of K-means begins. New centroids will be recomputed based on the newly as- signed data points in k centroids. Each node that performs Centroid Update need to communicate simultaneously since the computation requires the global mean accu- mulated by the master node. The parallel K-means algorithm design is shown in Figure 1. 4 Implementation and Result We run the program on Aurora Linux Cluster with a maximum of 8 nodes using mes- sage passing programming model. Each node has CPU speed of 1396 MHz, swap memory of 500 Mb and total memory of 1 GB for clients and swap memory of 520 Mb and total memory of 2.0 GB for the server. We tested on three datasets which have been statistically analyzed and published by P.Chaudhuri and S.Das[9] to benchmark our cluster result. The three datasets are ribosomal RNA for twenty fOUf organisms, vertebrate mitochondrial DNA sequences and complete genomes of roundworm. Next we are interested in studying the impact of parallelizing the se- quential K-means algorithm in tenus of performance. Thereby we assume our cluster result is acceptable. Figure 2(a) shows the result of executing parallel K-means algo- rithm on ribosomal RNA sequences of 24 organisms and Figure 2(b) shows the result
Parallel K-Means Clustering Algorithm on DNA Dataset Fazilah Othman, Rosni Abdullah, Nur' Aini Abdul Rashid, and Rosalina Abdul Salam School of Computer Science, Universiti Sains Malaysia, lIS' {fazot, rosni, nuraini, rosalina}@f Abstract. Clustering is a division of data into groups of similar objects. セk means has been used in many clustering work because of the ease of the algorithm. Our main effort is to parallelize the k-means clustering algorithm. The parallel version is implemented based on the inherent parallelism during the Distance Calculation and Centroid Update phases. The parallel K-means algorithm is designed in such a way that each P participating node is responsible for handling nIP data points. We run the program on a Linux Cluster with a maximum of eight nodes using message-passing programming model. We examined the perfonnance based on the percentage of correct answers and its speed·up performance. The outcome shows that our parallel K-means program perlonns relatively well on large datasets. 1 Introduction The objective of this work is to partition data into groups of similar items. Given a set of meaningless data and sets of representatives, our work will group the data according to the nearest representative. This work is useful in helping scientists explore new data and lead them to new discovery in relationships between data. It is widely employed in different disciplines which involve grouping massive data such as computational biology, botany, medicine, astronomy, marketing and image processing. A survey [1][2][3] on clustering algorithm reported that K-means is a popular, effective and practically feasible method widely applied by scientists. However, the rapid growth of data makes the processing time increase due to large computation time. [2] has implemented the K-means algorithm on DNA data using positional weight matrices (PWM) training The decreasing prices of personal computers make parallel implementation a practical approach. In this paper, we propose a parallel implementation of K-means clustering algorithm on a cluster of personal computers. This is to provide a practical and economically feasible solution. 2 K..Means Clustering Algorithm K-means algorithm works conveniently with numerical values and offers clear geometric representations. The basic K-means algorithm requires time proportionate to number of patterns and number of cluster per iteration. This is computationally expensive especially for large datasets [4]. To address these problems, parallelization K.-M. Liew et al. (Eds.): PDCAT 2004, LNCS 3320, pp. 248-251, 2004. Berlin Heidelberg 2004 © sーイゥョァ・セvャ。 Parallel K-Means Clustering Algorithm on DNA Dataset 249 has become a popular alternative which exploits the inherent data parallelism within algorithm. Efforts in parallelizing the K-means algorithm has sequential ウョ。・ュセk been done by [1][5][6][7][8] in areas such as image processing, medicine, astronomy marketing and biology. As our contribution we propose a parallel K-means clustering algorithm for DNA dataset running on a cluster of personal computers. 3 Parallel K-Means Clustering Algorithm The sequential algorithm spends much of it's time calculating new centroid and calculating the distances between n data points and k centroids. We can cut down the execution time by parallelizing these two operations. Our parallel kセュ・。ョウ algorithm is parallelized based on the inherent data-parallelism especially in the Distance Calculation and Centroid Update operations. The Distance Calculation operation can be executed asynchronously and in parallel for each data point (x.for lSi:::; n). We designed the parallel program in such a way that each participating P processor is responsible for handling nIP data points. The basic idea is to divide the n data points into P parts which are the approximate size for the portion of data which will be processed by the P independent nodes. However, each of the P nodes must update and store the mean and k latest centroid in the local cache. The master node will accumulate new assigned data points from each worker node and broadcast new global mean to all. The k centroids allow each node to perform distance calculation operation in parallel while the global mean pennit each node to decide on the convergence condition independently. The Centroid Update is performed in parallel. It is operated before the new iteration of K-means begins. New centroids will be recomputed based on the newly assigned data points in k centroids. Each node that performs Centroid Update need to communicate simultaneously since the computation requires the global mean accumulated by the master node. The parallel K-means algorithm design is shown in Figure 1. 4 Implementation and Result We run the program on Aurora Linux Cluster with a maximum of 8 nodes using message passing programming model. Each node has CPU speed of 1396 MHz, swap memory of 500 Mb and total memory of 1 GB for clients and swap memory of 520 Mb and total memory of 2.0 GB for the server. We tested on three datasets which have been statistically analyzed and published by P.Chaudhuri and S.Das[9] to benchmark our cluster result. The three datasets are ribosomal RNA for twenty fOUf organisms, vertebrate mitochondrial DNA sequences and complete genomes of roundworm. Next we are interested in studying the impact of parallelizing the sequential K-means algorithm in tenus of performance. Thereby we assume our cluster result is acceptable. Figure 2(a) shows the result of executing parallel K-means algorithm on ribosomal RNA sequences of 24 organisms and Figure 2(b) shows the result 250 F. Othman et al. of executing parallel K-means algorithm on the artificial dataset of 15.7 MB, which consist of 16 sequences, each of length I million base pair. We examined the performance based on the percentage of correct answers and highlight the speed-up. The outcome shows that our parallel K-means algorithm performs relatively well on large dataset. (1) Select k initial centroids. Each P participating nodess keeps a local copy of k centroids and n data oints. For ease of ex lanation, let assumes: n=20, k = 2, P = 4 (2) Divide niP. Each nodes needs to handle only nIP data points I Pl 1 I Points [I] II' t [I] I •••• Gセa B I t. PI' " [I] I Pi I [I] , •• '. For each node: (1). Calculate Euclidean distance between each niP data point and k centroids. (2).Find the closest centroid to each data points Sum up the data points and count the number of ssigned data points (3). Calculate Local mean. Communicate synchronously Master node (1) sums up the number of assigned data points (2) assembles mean of the newly assigned ウエョゥGッセ。、 from P·l node. StS up and broadcaste global mean to ott nodes .1. Perform Centroid Update operation セ ... セ セ .. Compare between new kcentroid and old kcentroid. Decide on convergence condition. 3) Each P nodes repeats the flow until they meet the convergence Fig. 1. The diagram of Parallel K-means Algorithm 3.4 """""".-.---,-,------, ----""'---"セM u セ LN MBG BGM M セ H , - - -..- - ' - - - - - - - - - - ' - - - - , 1 r---""-,---,----,.----,--......., 2 6 l--k.2 1,608045 2,12231 2,429979 2,70JJJ5 1.680952 2,1941 2.496768 2,851471 1,645027 2169152 3.236341 2860J5 (a) _Oセ、 エセ 1.6 1.J 1 MiZNセ l---k.2 1·.. k=3 ,,/ ..-'-- - - , - - - - - . - - - - , . - - - - - - i 1.601764 2,421373 3,172048 1,596238 2,750785 3,0J8694 1,596691 2_695104 2,92887 (b) Fig. 2. (a) Speed-up perfonnance for parallel K-means using ribosomal RNA sequences of 24 organisms; (b) Speed-up perfonnance for parallel K-means using artificial dataset Parallel K-Means Clustering Algorithm on DNA Dataset 251 5 Conclusion and Future Work The experiments camed out showed that the parallel K-means algorithm starts making progress on a large dataset. In order to improve the accuracy of the cluster results, we observed that attention shoul4 be given to the data training phase. In our program, we applied the PWM method where we calculated the frequency of nucleotides A, T, C and G for each position in the sequences. However, the DNA sequence is very rich with gene information and the arrangement within the nucleotides gives crucial information. It is very interesting to employ other method called the distribution of DNA words that focus on the word frequency-based approach as reported in [9]. We hope to port our work to a high performance cluster of Sun machine. The SUN Cluster is a new facility provided by the Parallel and Distributed Computing Centre, School of Computer Science, USM. With 2 GB memory space on the server machine alone and a total hard disk external storage of 70 GB on the clusters machines, it is hoped that it will produce more encouraging results. References 1. Inderjit S. Dillon and Dharmendra S. Modha, "A Data-Clustering on Distributed Memory Multiprocessors" in ACM SIGKDD Workshop on Large-Scale Parallel KDD System (KDD 99), August 1999. 2. Xiufeng Wan, Susan M. Bridges, John Boyle and Alan Boyle, "Interactive Clustering for Exploration of Genomic Data", Xiufeng Wan, Susan M. Bridges, John Boyle and Alan Boyle, Mississippi State University, Mississippi State, MS USA,2002 3. K.AIsabti, S.Ranka and V.Singh. An Efficient K-Means Clustering Algorithm. http://www.cise.ufledu/-rankal, 1997 4. K.Murakami and T.Takagi, "Clustering and Detectionof 5' Splices Sitesof mRNA by K Wight Matrices Model",Pac Symp BioComputing,1999, pp 171-181. 5. Kantabutra S. and Couch A.L, "Parallel K-means Clustering Algorithm on NOWs",NECTEC Technical Journal,vol1, no.6 (February 2002),pp 243-248. 6. Killian Stoffel and Abdelkader Belkoniene, "Parallel kihセュ・。ョウ Clustering for Large Data Sets",Proceedings of the European Conference on Parallel Processing EuroPar'99, 1999. 7. Kantabutra S., Naramittakapong, C. and Kompitak, P, "Pipeline K-means Algorithm on NOWs," Proceeding of the Third International Symposium on Communication and Information Technology (ISCIT2003),Hatyai,Songkla,Thailand,2003. 8. Forman,G and Zhang, B., "Linear Speed-Up for a parallel Non-Approximate Recasting of Center-Based Clustering Algorithm,including K-Means,K-Harmonic Means and EM," ACM SIGKDD Workshop on Distributed and Parallel Knowledge Discovery (KDD2000), Boston, MA, 2000. 9. Probal Chaudari and Sandip Dass,"Statistical Analysis of Large DNA sequences using distribution of DNA words", CURREBT SCIENCE, vol. 80, no. 9(lD may 2001) pp 1161 1166.