A parallel computing algorithm for 16S rRNA probe design

Yuriy Fofanov; George E Fox

A parallel computing algorithm for 16S rRNA probe design

Journal of Parallel and Distributed Computing, 2006

D. Zhu et al. / J. Parallel Distrib. Comput. 66 (2006) 1546–1551 1547 value speciﬁes a minimal portion of the target sequences that must be uniquely identiﬁed in order for a non-perfect probe to be selected. In order to parallelize PPP, it is natural to assign each cluster to a processor. The task of each processor would be to identify probes for the assigned cluster. Assigning tasks to the proces- sors would be straightforward if all the tasks consumed similar amounts of CPU time. Unfortunately, the number of sequences in each cluster varies signiﬁcantly and therefore individual tasks differ dramatically in the amount of CPU time required. An even distribution of the tasks would thus introduce signiﬁcant workload bias. For this reason, we parallelized the program PPP by dynamically balancing workload in a task-oriented fashion. The parallelized version of PPP was implemented in C++ using the MPICH message-passing interface. Unfortunately, our computational cluster environment has a version of MPICH1.2.6, which does not include a client/server com- munication model. Therefore it was necessary to implement our own client/server protocol using the existing modules in MPICH1.2.6. 2. Method 2.1. The client/server model The main challenge of parallelizing PPP is that each task takes a signiﬁcantly different amount of CPU time. If all pro- cessors take an equal number of tasks, notoriously biased work- loads will be expected. To solve this problem, one processor should be appointed as the server to evenly assign workloads to each processor. All other processors are clients who request a new task from the server once they ﬁnish the current one. Since the communication model in MPICH1.2.6 is in a point- to-point fashion without client/server communications, we de- signed a simple client/server model with the existing modules in MPICH1.2.6. In our simple client/server protocol, a client sends a request to the server by specifying its identiﬁcation in the requesting message: MPI::COMM_WORLD.Send (& me, 1, MPI_INT, serverID, tag). The server listens to all processors by using the wildcards MPI _ANY _SOURCE and MPI _ANY _T AG tag: MPI :: COMM_WORLD.Recv (&fromWhom, 1, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG). Once receiving a message, the server identiﬁes a requesting client by the variable fromWhom and sends a new task to the client if there are tasks left. If no more tasks are available, the server sends a special number to the client indicating that all tasks have been processed. A client will end its execution after it gets this special number. To fully utilize the CPU time, in our client/server model, the server’s duty is not restricted to assigning tasks to its clients. It takes tasks as well. The server creates a thread to serve the task assignment duty while its main process works on its own tasks. 2.2. Synchronization on the access of current task at the server side In our design, to avoid unnecessary communications, the main process can directly access the current task without send- ing a requesting message to the thread process. Since both the main and the thread processes on the server have the privilege of accessing the current task, unpredictable results will hap- pen when both of them carry out an access simultaneously. To avoid this race condition, we introduced a semaphore to syn- chronize the two processes: one of them will suspend its access while the other one is accessing if both of them act upon the following mechanism: (1) perform a wait on the semaphore, (2) access the current task, (3) update the current task, (4) perform a signal on the semaphore. 2.3. Algorithms The algorithm for the parallelized version of PPP is demon- strated in Algorithm I. Initially, the current task, which is spec- iﬁed by the variable currentGroup in Algorithm I, is set as 1. Processor 0 is appointed as the server which creates a thread process to serve the duty of assigning tasks while its main pro- cess is working on tasks. The server can request a new task by a synchronized access of the current task while all client pro- cessors should request a task from the server. The current task is set to be -1 if there are no more tasks left. A processor will exit the while loop once it receives a -1 as the current task. The main process on the sever side should wait until the thread process is over since it could happen that the thread is still wait- ing for requests when the main process has ﬁnished its work. Before the main process exits, it removes the semaphore. Algorithm I 1: start MPI 2: currentGroup=1; //current task starts from group 1 3: if(processor ==0) //processor 0 will be the server 4: createSemaphore(); //to sychronize the ac- cess of currentGroup 5: createThread(); //the thread will execute function assignTask; 6: end if 7: do while 8: if (processor ==0) //processor 0 can get a task directly 9: enterCriticalSection(); //sychronize the access of currentGroup 10: if (currentGroup > totalGroups) //no more task 11: group2Work =-1; 12: else 13: group2Work=currentGroup; 14: currentGroup++; 15: end if

J. Parallel Distrib. Comput. 66 (2006) 1546 – 1551 www.elsevier.com/locate/jpdc A parallel computing algorithm for 16S rRNA probe design Dianhui Zhu a , Yuriy Fofanov a , Richard C. Willson b , George E. Fox c,∗ a Department of Computer Science, University of Houston, Houston, TX 77204-3010, USA b Department of Chemical Engineering, University of Houston, Houston, TX 77204-4004, USA c Department of Biology and Biochemistry, University of Houston, Houston, TX 77204-5001, USA Received 9 September 2005; accepted 14 March 2006 Available online 2 May 2006 Abstract With the continuing rapid increase in the number of available 16S ribosomal RNA (rRNA) sequences, it is a significant computational challenge to efficiently design 16S rRNA targeted probes. In our previous work, we designed a fast software tool called ProkProbePicker (PPP) that takes O(log N) time for a worst-case scenario search. Despite this improvement, it can still take many hours for PPP to extract probes for all the clusters in a phylogenetic tree. Herein, a parallelized version of PPP is described. When run on 80 processors, this version of PPP took only 67 min to extract probes, while some 87 h were needed by the sequential version of PPP. The speedup increased linearly with the increase of CPU numbers, which revealed the outstanding scalability of the parallelized version of PPP. © 2006 Elsevier Inc. All rights reserved. Keywords: 16S rRNA targeted probes; Genetic affinity; Parallel algorithm 1. Introduction As a key component of the 30S subunit in prokaryotic ribosomes, 16S ribosomal RNA (16S rRNA) is universally found in bacteria and is easily isolated. 16S rRNA sequence comparisons have become the single most important tool in defining and determining bacterial relationships [2,4]. As a result, it has become common place to identify individual bacteria or related groups (clusters) of bacteria based on the presence of unique subsequences in their 16S rRNA. These subsequences are subsequently used in conjunction with any of a large variety of experimental approaches such as microarray hybridization [8,7] and fluorescence in situ hybridization (FISH) [5,6] to develop laboratory or field-based assays. The utility of the 16S rRNA sequence information has resulted in a minimally curated data set of over 150,000 partial and complete bacterial 16S rRNA sequences that continues to grow at an increasing rate. Bioinformatics tools such as PRIMROSE [1] and ARB [3] can efficiently identify promising probes or primers for use with a single organism or a group of related organisms. However, ∗ Corresponding author. Fax: +1 713 743 8351. E-mail address: fox@uh.edu (G.E. Fox). 0743-7315/$ - see front matter © 2006 Elsevier Inc. All rights reserved. doi:10.1016/j.jpdc.2006.03.002 emerging experimental approaches seek to simultaneously detect numerous organisms of interest thereby requiring the identification of large numbers of compatible probes or primers. In our previous work, we developed a software tool, ProkProbePikcer (PPP) [9], to design probes for all the major clusters in the phylogenetic tree (cladogram) given by the RDP database release 7.1 (http://rdp.cme.msu.edu/index.jsp). By employing the Karp–Rabin algorithm and the AVL tree data structure, PPP has a time complexity of O(log N ) for a worstcase scenario search. Nevertheless, it still takes tens of hours when using a typical personal computer to identify promising probes for large numbers of organisms or organism groupings. To address this computational problem, we herein describe a parallelized version of PPP and demonstrate its utility in a cluster environment. PPP seeks to extract probes for each cluster in a userprovided cladogram. For a target cluster, subsequences that are present in all target sequences but are absent from all non-target sequences (all other sequences in the cladogram) are extracted as perfect probes. If there are no perfect probes for a particular cluster, PPP selects those that are partially unique in the target group. PPP has two use-defined parameters: probe length and a cutoff value for non-perfect probes. The Karp–Rabin algorithm limits probe length to 32 residues or less. The cutoff D. Zhu et al. / J. Parallel Distrib. Comput. 66 (2006) 1546 – 1551 1547 value specifies a minimal portion of the target sequences that must be uniquely identified in order for a non-perfect probe to be selected. In order to parallelize PPP, it is natural to assign each cluster to a processor. The task of each processor would be to identify probes for the assigned cluster. Assigning tasks to the processors would be straightforward if all the tasks consumed similar amounts of CPU time. Unfortunately, the number of sequences in each cluster varies significantly and therefore individual tasks differ dramatically in the amount of CPU time required. An even distribution of the tasks would thus introduce significant workload bias. For this reason, we parallelized the program PPP by dynamically balancing workload in a task-oriented fashion. The parallelized version of PPP was implemented in C++ using the MPICH message-passing interface. Unfortunately, our computational cluster environment has a version of MPICH1.2.6, which does not include a client/server communication model. Therefore it was necessary to implement our own client/server protocol using the existing modules in MPICH1.2.6. 2.2. Synchronization on the access of current task at the server side 2. Method The algorithm for the parallelized version of PPP is demonstrated in Algorithm I. Initially, the current task, which is specified by the variable currentGroup in Algorithm I, is set as 1. Processor 0 is appointed as the server which creates a thread process to serve the duty of assigning tasks while its main process is working on tasks. The server can request a new task by a synchronized access of the current task while all client processors should request a task from the server. The current task is set to be −1 if there are no more tasks left. A processor will exit the while loop once it receives a −1 as the current task. The main process on the sever side should wait until the thread process is over since it could happen that the thread is still waiting for requests when the main process has finished its work. Before the main process exits, it removes the semaphore. 2.1. The client/server model The main challenge of parallelizing PPP is that each task takes a significantly different amount of CPU time. If all processors take an equal number of tasks, notoriously biased workloads will be expected. To solve this problem, one processor should be appointed as the server to evenly assign workloads to each processor. All other processors are clients who request a new task from the server once they finish the current one. Since the communication model in MPICH1.2.6 is in a pointto-point fashion without client/server communications, we designed a simple client/server model with the existing modules in MPICH1.2.6. In our simple client/server protocol, a client sends a request to the server by specifying its identification in the requesting message: MPI::COMM_WORLD.Send (& me, 1, MPI_INT, serverID, tag). The server listens to all processors by using the wildcards MP I _ANY _SOU RCE and MP I _ANY _T AG tag: MPI :: COMM_WORLD.Recv (&fromWhom, 1, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG). Once receiving a message, the server identifies a requesting client by the variable fromWhom and sends a new task to the client if there are tasks left. If no more tasks are available, the server sends a special number to the client indicating that all tasks have been processed. A client will end its execution after it gets this special number. To fully utilize the CPU time, in our client/server model, the server’s duty is not restricted to assigning tasks to its clients. It takes tasks as well. The server creates a thread to serve the task assignment duty while its main process works on its own tasks. In our design, to avoid unnecessary communications, the main process can directly access the current task without sending a requesting message to the thread process. Since both the main and the thread processes on the server have the privilege of accessing the current task, unpredictable results will happen when both of them carry out an access simultaneously. To avoid this race condition, we introduced a semaphore to synchronize the two processes: one of them will suspend its access while the other one is accessing if both of them act upon the following mechanism: (1) (2) (3) (4) perform a wait on the semaphore, access the current task, update the current task, perform a signal on the semaphore. 2.3. Algorithms 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: Algorithm I start MPI currentGroup=1; //current task starts from group 1 if(processor ==0) //processor 0 will be the server createSemaphore(); //to sychronize the access of currentGroup createThread(); //the thread will execute function assignTask; end if do while if (processor ==0) //processor 0 can get a task directly enterCriticalSection(); //sychronize the access of currentGroup if (currentGroup > totalGroups) //no more task group2Work = −1; else group2Work=currentGroup; currentGroup++; end if 1548 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: exitCriticalSection(); if (group2Work ==−1) exit while loop end if extractProbes(); else group2work=requestTaskFromServer(); //clients request a new task from the server if (group2Work==−1) exit while loop end if end if end do if (processor ==0) wait util the thread is finished destroy the semaphore end if end MPI The thread processor on the server side assigns tasks to clients by executing the function assignTask (Algorithm II). The termination of the thread is controlled by the variable numberOfClientsLeft, which is initially the total number of clients not including the server itself. As is done in the main process of the server, the thread’s access of the current task is synchronized as well. When all tasks are complete, the thread process sends a task of −1 to a requesting clients and decreases numberOfClientsLeft by one, telling the requesting client that no more task are available and that it is the time for the client to terminate, which is exactly what the client will do once it receives a task of −1. The thread process terminates once numberOfClients is equal to 0. Algorithm II: 1: function assignTask 2: numberOfClientsLeft = totalNumberOfClients //not include the server itself 3: do while (numberOfClientsLeft >0) //if any client is still working 4: receiveRequest(); 5: enterCriticalSection(); 6: if (currentGroup > totalGroups) 7: group2Assign = −1; 8: numberOfClients–; 9: else 10: group2Assign=currentGroup; 11: currentGroup++; 12: end if 13: exitCriticalSection(); 14: sendTask2Client(); 15: end do 16: end function assignTask 3. Results and discussion The parallelized version of PPP was tested on the Atlantis cluster, which is a HP Itanium2 based cluster at the Texas Learning and Computation Center (TLC2 ) (http://www.tlc2.uh.edu). 80.0 70.0 60.0 50.0 Speedup 16: 17: 18: 19: 20: 21: 22: D. Zhu et al. / J. Parallel Distrib. Comput. 66 (2006) 1546 – 1551 40.0 30.0 20.0 10.0 0.0 0 8 16 24 32 40 48 # of CPUs 56 64 72 80 Fig. 1. Speedup values when parallelized PPP was run on different number of CPUs. A very strong linear relationship can be observed. This cluster has 152 nodes and runs RedHat Enterprise Linux 3 for the ia64 architecture. Each node has two 1.3 GHz CPUs with 4GB memory. The probe length in our test runs was 20 nucleotides, i.e., we extracted 20mers for the whole phylogenetic tree (cladogram) specified by RDP release 7.1 which has 7321 clusters that are assigned as tasks. The non-perfect probe cutoff value was 50%, which means non-perfect probes were required to uniquely occur in at least half of the sequences in the cluster under consideration. The parallelized version of PPP was run on 4–80 stepping at 4 processors. The original sequential version of PPP was used as a baseline for performance assessment. The speedup and performance efficiency evaluation was based on the average elapsed time of 3 replicates (Table 1). It took approximately 87 h to run the sequential version of PPP on Atlantis, which provided the motivation for parallelizing the code. Compared to the 67 min that were required when 80 processors worked together, the performance was sped up by a factor of 77.6 (Table 1). A very strong linear relationship between CPU number and speedup can be observed in Fig. 1, which explicitly elucidates an excellent scalability of the parallelized version. Performance efficiency values on different number of CPUs (Fig. 2) also supported the scalability of the parallelized version of PPP since they were stable. When the number of CPUs was 80, the efficiency was 97.0%. Considering the efficiency value of 96.8% when CPU number was 4, the parallelized version of PPP exhibited steady performance efficiencies with increasing numbers of CPUs. In addition the performance of the workload balancing mechanism was evaluated. Since the parallelized version of PPP balances workloads in a task-based fashion, it is unlikely to completely balance workloads among all processors given that tasks consume considerably different amount of CPU time. The processor that takes the longest CPU time on its tasks will require processors that have already completed their tasks to wait. We first calculated the ratio (R) of total idle time to total computation time. 1549 D. Zhu et al. / J. Parallel Distrib. Comput. 66 (2006) 1546 – 1551 Table 1 Elapsed time on different number of CPUs. Corresponding speedup and efficiency are also shown # of CPUs Elapsed time (s) 1 (Sequential) 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 Run 1 Run 2 Run 3 313 863 80 819 39 903 26 932 20 202 16 174 13 477 11 542 10 104 8989 7962 7278 6743 6227 5741 5403 5064 4749 4490 4296 4032 313 374 80 876 39 682 26 927 20 190 16 172 13 474 11 544 10 102 8984 7957 7313 6750 6221 5744 5407 5071 4747 4492 4284 4040 311 680 80 922 39 681 26 956 20 202 16 172 13 479 11 543 10 102 8987 7951 7361 6741 6231 5746 5400 5061 4752 4489 4290 4028 Average (s) Speedup Efficiency (%) 312 972 80 872 39 755 26 939 20 198 16 172 13 477 11 543 10 103 8987 7957 7317 6745 6226 5744 5403 5065 4749 4490 4290 4033 3.9 7.9 11.6 15.5 19.4 23.2 27.1 31.0 34.8 39.3 42.8 46.4 50.3 54.5 57.9 61.8 65.9 69.7 73.0 77.6 96.8 98.3 96.8 96.9 97.0 96.8 96.8 96.8 96.7 98.4 97.2 96.7 96.7 97.3 96.5 96.6 96.9 96.8 96.9 97.0 100.0 Efficiency (%) 98.0 96.0 94.0 92.0 90.0 0 8 16 24 32 40 48 # of CPUs 56 64 72 80 Fig. 2. Performance efficiency on different number of CPUs. Denoting Ti as the CPU time that is consumed by the ith processor on its tasks and Tmax as the longest time that is needed among all processors, R is calculated as: (Tmax − Ti ) , Ti 0 i < N where N is the total number of CPUs. (1) R= Essentially, the value of R reflects the overall balance of workloads. A lower R value indicates a better balance in general. We evaluated the R value in one of our experiments where the number of CPUs was 80. Processor 20 had the longest CPU time (Tmax = 4032 s) (Fig. 3). With Eq. (1), the value of R was 0.6%, which means compared to total CPU time that were spent on tasks by all processors, 0.6% of the CPU time was idle on the average. It is not entirely sufficient to evaluate the balancing mechanism by the value of R since it is just an overall index. Workload bias on individual processors is important as well with the fact that a single large bias in one processor also indicates an undesirable workload balancing. For this reason, we also considered the idle to computation time ratio on the processor with the largest workload bias. In Fig. 3, the largest bias occurred on processor 5 where the task computation time was 3986 s and 1550 D. Zhu et al. / J. Parallel Distrib. Comput. 66 (2006) 1546 – 1551 4040 Elapsed time (seconds) 4030 4020 4010 4000 3990 3980 3970 3960 0 5 10 15 20 25 30 35 40 45 Processor ID 50 55 60 65 70 75 Fig. 3. Elapsed time on each processor when the parallel version was run on 80 CPUs in one experiment. Process 0 had the minimum elapsed time (3986 s) while the maximum elapsed time (4032 s) happened on processor 20. the idle time was 46 s. The ratio of idle to computation time on processor 5 was thus a minimal 1.15%. These results indicated that the workload balancing algorithm used in the parallelized version of PPP performed adequately. 4. Future work Herein, we described a first-generation algorithm that can take advantage of the resources of a cluster computing environment to design candidate 16S rRNA targeted probes for use in bacterial identification and monitoring. The current parallel version of PPP seeks to balance workload among all the available processors without evenly distributing data. However, as the data sets to be analyzed continue to grow it may become impractical for each processor to keep a local copy of all the data. Therefore, future versions of PPP will need to take this into account. Acknowledgments This research was funded in part by Grants from the Institute of Space Systems Operations to GEF, the National Aeronautics and Space Administration Office of Biological and Physical Research (NNJ04HF43G) to GEF and RCW and the Texas Learning and Computational Center (TLC2 ) to YF, GEF and RCW. We thank the system administrators in TLC2 for their technical supports. References [1] K.E. Ashelford, A.J. Weightman, J.C. Fry, PRIMROSE: a computer program for generating and estimating the phylogenetic range of 16S rRNA oligonucleotide probes and primers in conjunction with the RDPII database, Nucleic Acids Res. 30 (2002) 3481–3489. [2] G.E. Fox, A.K. Naik, The evolutionary history of the translation machinery, in: L. Ribas de Pouplana (Ed.), The Genetic Code and the Origin of Life, Landes Bioscience/Eurekah.com, Kluwer Academic/Plenum, Georgetown, Tex., New York, NY, 2004, pp. 92–105. [3] W. Ludwig, O. Strunk, R. Westram, L. Richter, H. Meier, Yadhukumar, A. Buchner, T. Lai, S. Steppi, G. Jobb, W. Forster, I. Brettske, S. Gerber, A.W. Ginhart, O. Gross, S. Grumann, S. Hermann, R. Jost, A. Konig, T. Liss, R. Lussmann, M. May, B. Nonhoff, B. Reichel, R. Strehlow, A. Stamatakis, N. Stuckmann, A. Vilbig, M. Lenke, T. Ludwig, A. Bode, K.H. Schleifer, ARB: a software environment for sequence data, Nucleic Acids Res. 32 (2004) 1363–1371. [4] G.J. Olsen, C.R. Woese, Archaeal genomics: an overview, Cell 89 (1997) 991–994. [5] M. Stoffels, T. Castellanos, A. Hartmann, Design and application of new 16S rRNA-targeted oligonucleotide probes for the Azospirillum–Skermanella–Rhodocista-cluster, Syst. Appl. Microbiol. 24 (2001) 83–97. [6] C. Urzi, V. La Cono, E. Stackebrandt, Design and application of two oligonucleotide probes for the identification of Geodermatophilaceae strains using fluorescence in situ hybridization (FISH), Environ. Microbiol. 6 (2004) 678–685. [7] R.F. Wang, M.L. Beggs, B.D. Erickson, C.E. Cerniglia, DNA microarray analysis of predominant human intestinal bacteria in fecal samples, Mol. Cell Probes. 18 (2004) 223–234. [8] K.S. Wilson, H.F. Noller, Mapping the position of translational elongation factor EF-G in the ribosome by directed hydroxyl radical probing, Cell 92 (1998) 131–139. [9] D. Zhu, Y. Fofanov, R.C. Willson, G.E. Fox, ProkProbePicker (PPP): a fast program to extract 16S rRNA-targeted probes for prokaryotes, in: Proceedings of the 2005 International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences (METMBS ’05), 2005, pp. 41–47. Dianhui Zhu is a Ph.D. candidate of the Department of Computer Science at the University of Houston, where he received his M.Sc. degree in Computer Science in 2003. He received his B.S. degree in Plant Protection and his M.Sc. degree in Entomology from the Huazhong Agricultural University and China Agricultural University in China, respectively. His main research interests include the investigation on efficient algorithms for bacterial probe/primer design, the development of parallel implementations of algorithms for sequence analysis. He is also involved in the studies on Shine–Dalgarno motif analysis and non-coding RNA recognition in bacteria by computational methods. Yuriy Fofanov received his Ph.D. degree in Physics and Mathematics from the Kuybishev (Samara) State University, USSR, in 1988. He is currently an Assistant Professor with the Computer Science and Biology and Biochemistry D. Zhu et al. / J. Parallel Distrib. Comput. 66 (2006) 1546 – 1551 Departments at the University of Houston, Houston, TX, USA. His main research interests include bioinformatics. George E. Fox received his Ph.D. in Chemical Engineering from the Syracuse University. He is currently a Professor of Biology and Biochemistry as well as Chemical Engineering at the University of Houston. He is also a Member 1551 of the Texas Learning and Computation Center. He was a pioneer in the use of 5S rRNA and 16S rRNA to decipher phylogenetic relationships among bacteria. His current research interests include RNA structure, function and evolution as well as practical applications based on RNA information such as the design of RNA-targeted probes for use use in microbial identification and monitoring.

Log In

A parallel computing algorithm for 16S rRNA probe design