J. Parallel Distrib. Comput. 66 (2006) 1546 – 1551
www.elsevier.com/locate/jpdc
A parallel computing algorithm for 16S rRNA probe design
Dianhui Zhu a , Yuriy Fofanov a , Richard C. Willson b , George E. Fox c,∗
a Department of Computer Science, University of Houston, Houston, TX 77204-3010, USA
b Department of Chemical Engineering, University of Houston, Houston, TX 77204-4004, USA
c Department of Biology and Biochemistry, University of Houston, Houston, TX 77204-5001, USA
Received 9 September 2005; accepted 14 March 2006
Available online 2 May 2006
Abstract
With the continuing rapid increase in the number of available 16S ribosomal RNA (rRNA) sequences, it is a significant computational
challenge to efficiently design 16S rRNA targeted probes. In our previous work, we designed a fast software tool called ProkProbePicker (PPP)
that takes O(log N) time for a worst-case scenario search. Despite this improvement, it can still take many hours for PPP to extract probes
for all the clusters in a phylogenetic tree. Herein, a parallelized version of PPP is described. When run on 80 processors, this version of PPP
took only 67 min to extract probes, while some 87 h were needed by the sequential version of PPP. The speedup increased linearly with the
increase of CPU numbers, which revealed the outstanding scalability of the parallelized version of PPP.
© 2006 Elsevier Inc. All rights reserved.
Keywords: 16S rRNA targeted probes; Genetic affinity; Parallel algorithm
1. Introduction
As a key component of the 30S subunit in prokaryotic ribosomes, 16S ribosomal RNA (16S rRNA) is universally found
in bacteria and is easily isolated. 16S rRNA sequence comparisons have become the single most important tool in defining and determining bacterial relationships [2,4]. As a result,
it has become common place to identify individual bacteria or
related groups (clusters) of bacteria based on the presence of
unique subsequences in their 16S rRNA. These subsequences
are subsequently used in conjunction with any of a large variety of experimental approaches such as microarray hybridization [8,7] and fluorescence in situ hybridization (FISH) [5,6]
to develop laboratory or field-based assays. The utility of the
16S rRNA sequence information has resulted in a minimally
curated data set of over 150,000 partial and complete bacterial
16S rRNA sequences that continues to grow at an increasing
rate. Bioinformatics tools such as PRIMROSE [1] and ARB [3]
can efficiently identify promising probes or primers for use with
a single organism or a group of related organisms. However,
∗ Corresponding author. Fax: +1 713 743 8351.
E-mail address: fox@uh.edu (G.E. Fox).
0743-7315/$ - see front matter © 2006 Elsevier Inc. All rights reserved.
doi:10.1016/j.jpdc.2006.03.002
emerging experimental approaches seek to simultaneously
detect numerous organisms of interest thereby requiring
the identification of large numbers of compatible probes or
primers. In our previous work, we developed a software tool,
ProkProbePikcer (PPP) [9], to design probes for all the major
clusters in the phylogenetic tree (cladogram) given by the RDP
database release 7.1 (http://rdp.cme.msu.edu/index.jsp). By
employing the Karp–Rabin algorithm and the AVL tree data
structure, PPP has a time complexity of O(log N ) for a worstcase scenario search. Nevertheless, it still takes tens of hours
when using a typical personal computer to identify promising
probes for large numbers of organisms or organism groupings.
To address this computational problem, we herein describe a
parallelized version of PPP and demonstrate its utility in a
cluster environment.
PPP seeks to extract probes for each cluster in a userprovided cladogram. For a target cluster, subsequences that are
present in all target sequences but are absent from all non-target
sequences (all other sequences in the cladogram) are extracted
as perfect probes. If there are no perfect probes for a particular
cluster, PPP selects those that are partially unique in the target
group. PPP has two use-defined parameters: probe length and
a cutoff value for non-perfect probes. The Karp–Rabin algorithm limits probe length to 32 residues or less. The cutoff
D. Zhu et al. / J. Parallel Distrib. Comput. 66 (2006) 1546 – 1551
1547
value specifies a minimal portion of the target sequences that
must be uniquely identified in order for a non-perfect probe to
be selected.
In order to parallelize PPP, it is natural to assign each cluster
to a processor. The task of each processor would be to identify
probes for the assigned cluster. Assigning tasks to the processors would be straightforward if all the tasks consumed similar
amounts of CPU time. Unfortunately, the number of sequences
in each cluster varies significantly and therefore individual tasks
differ dramatically in the amount of CPU time required. An
even distribution of the tasks would thus introduce significant
workload bias. For this reason, we parallelized the program
PPP by dynamically balancing workload in a task-oriented
fashion.
The parallelized version of PPP was implemented in C++
using the MPICH message-passing interface. Unfortunately,
our computational cluster environment has a version of
MPICH1.2.6, which does not include a client/server communication model. Therefore it was necessary to implement
our own client/server protocol using the existing modules in
MPICH1.2.6.
2.2. Synchronization on the access of current task at the
server side
2. Method
The algorithm for the parallelized version of PPP is demonstrated in Algorithm I. Initially, the current task, which is specified by the variable currentGroup in Algorithm I, is set as 1.
Processor 0 is appointed as the server which creates a thread
process to serve the duty of assigning tasks while its main process is working on tasks. The server can request a new task by
a synchronized access of the current task while all client processors should request a task from the server. The current task
is set to be −1 if there are no more tasks left. A processor will
exit the while loop once it receives a −1 as the current task.
The main process on the sever side should wait until the thread
process is over since it could happen that the thread is still waiting for requests when the main process has finished its work.
Before the main process exits, it removes the semaphore.
2.1. The client/server model
The main challenge of parallelizing PPP is that each task
takes a significantly different amount of CPU time. If all processors take an equal number of tasks, notoriously biased workloads will be expected. To solve this problem, one processor
should be appointed as the server to evenly assign workloads
to each processor. All other processors are clients who request
a new task from the server once they finish the current one.
Since the communication model in MPICH1.2.6 is in a pointto-point fashion without client/server communications, we designed a simple client/server model with the existing modules
in MPICH1.2.6.
In our simple client/server protocol, a client sends a request
to the server by specifying its identification in the requesting
message:
MPI::COMM_WORLD.Send (& me, 1, MPI_INT, serverID,
tag).
The server listens to all processors by using the wildcards
MP I _ANY _SOU RCE and MP I _ANY _T AG tag:
MPI :: COMM_WORLD.Recv (&fromWhom, 1, MPI_INT,
MPI_ANY_SOURCE, MPI_ANY_TAG).
Once receiving a message, the server identifies a requesting
client by the variable fromWhom and sends a new task to the
client if there are tasks left. If no more tasks are available, the
server sends a special number to the client indicating that all
tasks have been processed. A client will end its execution after
it gets this special number.
To fully utilize the CPU time, in our client/server model, the
server’s duty is not restricted to assigning tasks to its clients.
It takes tasks as well. The server creates a thread to serve the
task assignment duty while its main process works on its own
tasks.
In our design, to avoid unnecessary communications, the
main process can directly access the current task without sending a requesting message to the thread process. Since both the
main and the thread processes on the server have the privilege
of accessing the current task, unpredictable results will happen when both of them carry out an access simultaneously. To
avoid this race condition, we introduced a semaphore to synchronize the two processes: one of them will suspend its access
while the other one is accessing if both of them act upon the
following mechanism:
(1)
(2)
(3)
(4)
perform a wait on the semaphore,
access the current task,
update the current task,
perform a signal on the semaphore.
2.3. Algorithms
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
Algorithm I
start MPI
currentGroup=1; //current task starts from group 1
if(processor ==0) //processor 0 will be the server
createSemaphore(); //to sychronize the access of currentGroup
createThread(); //the thread will execute
function assignTask;
end if
do while
if (processor ==0) //processor 0 can get
a task directly
enterCriticalSection(); //sychronize the
access of currentGroup
if (currentGroup > totalGroups) //no more task
group2Work = −1;
else
group2Work=currentGroup;
currentGroup++;
end if
1548
23:
24:
25:
26:
27:
28:
29:
30:
31:
32:
exitCriticalSection();
if (group2Work ==−1)
exit while loop
end if
extractProbes();
else
group2work=requestTaskFromServer();
//clients request a new task from the server
if (group2Work==−1)
exit while loop
end if
end if
end do
if (processor ==0)
wait util the thread is finished
destroy the semaphore
end if
end MPI
The thread processor on the server side assigns tasks to
clients by executing the function assignTask (Algorithm II).
The termination of the thread is controlled by the variable numberOfClientsLeft, which is initially the total number of clients
not including the server itself. As is done in the main process
of the server, the thread’s access of the current task is synchronized as well. When all tasks are complete, the thread process
sends a task of −1 to a requesting clients and decreases numberOfClientsLeft by one, telling the requesting client that no
more task are available and that it is the time for the client to
terminate, which is exactly what the client will do once it receives a task of −1. The thread process terminates once numberOfClients is equal to 0.
Algorithm II:
1: function assignTask
2: numberOfClientsLeft = totalNumberOfClients //not
include the server itself
3: do while (numberOfClientsLeft >0) //if any client is
still working
4:
receiveRequest();
5:
enterCriticalSection();
6:
if (currentGroup > totalGroups)
7:
group2Assign = −1;
8:
numberOfClients–;
9:
else
10:
group2Assign=currentGroup;
11:
currentGroup++;
12: end if
13: exitCriticalSection();
14: sendTask2Client();
15: end do
16: end function assignTask
3. Results and discussion
The parallelized version of PPP was tested on the Atlantis
cluster, which is a HP Itanium2 based cluster at the Texas Learning and Computation Center (TLC2 ) (http://www.tlc2.uh.edu).
80.0
70.0
60.0
50.0
Speedup
16:
17:
18:
19:
20:
21:
22:
D. Zhu et al. / J. Parallel Distrib. Comput. 66 (2006) 1546 – 1551
40.0
30.0
20.0
10.0
0.0
0
8
16
24
32
40
48
# of CPUs
56
64
72
80
Fig. 1. Speedup values when parallelized PPP was run on different number
of CPUs. A very strong linear relationship can be observed.
This cluster has 152 nodes and runs RedHat Enterprise Linux
3 for the ia64 architecture. Each node has two 1.3 GHz CPUs
with 4GB memory. The probe length in our test runs was 20 nucleotides, i.e., we extracted 20mers for the whole phylogenetic
tree (cladogram) specified by RDP release 7.1 which has 7321
clusters that are assigned as tasks. The non-perfect probe cutoff
value was 50%, which means non-perfect probes were required
to uniquely occur in at least half of the sequences in the cluster under consideration. The parallelized version of PPP was
run on 4–80 stepping at 4 processors. The original sequential
version of PPP was used as a baseline for performance assessment. The speedup and performance efficiency evaluation was
based on the average elapsed time of 3 replicates (Table 1).
It took approximately 87 h to run the sequential version of
PPP on Atlantis, which provided the motivation for parallelizing
the code. Compared to the 67 min that were required when 80
processors worked together, the performance was sped up by
a factor of 77.6 (Table 1). A very strong linear relationship
between CPU number and speedup can be observed in Fig. 1,
which explicitly elucidates an excellent scalability of the parallelized version. Performance efficiency values on different
number of CPUs (Fig. 2) also supported the scalability of the
parallelized version of PPP since they were stable. When the
number of CPUs was 80, the efficiency was 97.0%. Considering the efficiency value of 96.8% when CPU number was 4,
the parallelized version of PPP exhibited steady performance
efficiencies with increasing numbers of CPUs.
In addition the performance of the workload balancing mechanism was evaluated. Since the parallelized version of PPP balances workloads in a task-based fashion, it is unlikely to completely balance workloads among all processors given that tasks
consume considerably different amount of CPU time. The processor that takes the longest CPU time on its tasks will require
processors that have already completed their tasks to wait. We
first calculated the ratio (R) of total idle time to total computation time.
1549
D. Zhu et al. / J. Parallel Distrib. Comput. 66 (2006) 1546 – 1551
Table 1
Elapsed time on different number of CPUs. Corresponding speedup and efficiency are also shown
# of CPUs
Elapsed time (s)
1 (Sequential)
4
8
12
16
20
24
28
32
36
40
44
48
52
56
60
64
68
72
76
80
Run 1
Run 2
Run 3
313 863
80 819
39 903
26 932
20 202
16 174
13 477
11 542
10 104
8989
7962
7278
6743
6227
5741
5403
5064
4749
4490
4296
4032
313 374
80 876
39 682
26 927
20 190
16 172
13 474
11 544
10 102
8984
7957
7313
6750
6221
5744
5407
5071
4747
4492
4284
4040
311 680
80 922
39 681
26 956
20 202
16 172
13 479
11 543
10 102
8987
7951
7361
6741
6231
5746
5400
5061
4752
4489
4290
4028
Average (s)
Speedup
Efficiency (%)
312 972
80 872
39 755
26 939
20 198
16 172
13 477
11 543
10 103
8987
7957
7317
6745
6226
5744
5403
5065
4749
4490
4290
4033
3.9
7.9
11.6
15.5
19.4
23.2
27.1
31.0
34.8
39.3
42.8
46.4
50.3
54.5
57.9
61.8
65.9
69.7
73.0
77.6
96.8
98.3
96.8
96.9
97.0
96.8
96.8
96.8
96.7
98.4
97.2
96.7
96.7
97.3
96.5
96.6
96.9
96.8
96.9
97.0
100.0
Efficiency (%)
98.0
96.0
94.0
92.0
90.0
0
8
16
24
32
40
48
# of CPUs
56
64
72
80
Fig. 2. Performance efficiency on different number of CPUs.
Denoting Ti as the CPU time that is consumed by the ith
processor on its tasks and Tmax as the longest time that is needed
among all processors, R is calculated as:
(Tmax − Ti )
,
Ti
0 i < N where N is the total number of CPUs. (1)
R=
Essentially, the value of R reflects the overall balance of workloads. A lower R value indicates a better balance in general.
We evaluated the R value in one of our experiments where
the number of CPUs was 80. Processor 20 had the longest CPU
time (Tmax = 4032 s) (Fig. 3). With Eq. (1), the value of R
was 0.6%, which means compared to total CPU time that were
spent on tasks by all processors, 0.6% of the CPU time was
idle on the average.
It is not entirely sufficient to evaluate the balancing mechanism by the value of R since it is just an overall index. Workload bias on individual processors is important as well with the
fact that a single large bias in one processor also indicates an
undesirable workload balancing. For this reason, we also considered the idle to computation time ratio on the processor with
the largest workload bias. In Fig. 3, the largest bias occurred on
processor 5 where the task computation time was 3986 s and
1550
D. Zhu et al. / J. Parallel Distrib. Comput. 66 (2006) 1546 – 1551
4040
Elapsed time (seconds)
4030
4020
4010
4000
3990
3980
3970
3960
0
5
10
15
20
25
30
35 40 45
Processor ID
50
55
60
65
70
75
Fig. 3. Elapsed time on each processor when the parallel version was run on 80 CPUs in one experiment. Process 0 had the minimum elapsed time (3986 s)
while the maximum elapsed time (4032 s) happened on processor 20.
the idle time was 46 s. The ratio of idle to computation time on
processor 5 was thus a minimal 1.15%. These results indicated
that the workload balancing algorithm used in the parallelized
version of PPP performed adequately.
4. Future work
Herein, we described a first-generation algorithm that can
take advantage of the resources of a cluster computing environment to design candidate 16S rRNA targeted probes for use
in bacterial identification and monitoring. The current parallel
version of PPP seeks to balance workload among all the available processors without evenly distributing data. However, as
the data sets to be analyzed continue to grow it may become
impractical for each processor to keep a local copy of all the
data. Therefore, future versions of PPP will need to take this
into account.
Acknowledgments
This research was funded in part by Grants from the Institute
of Space Systems Operations to GEF, the National Aeronautics
and Space Administration Office of Biological and Physical
Research (NNJ04HF43G) to GEF and RCW and the Texas
Learning and Computational Center (TLC2 ) to YF, GEF and
RCW. We thank the system administrators in TLC2 for their
technical supports.
References
[1] K.E. Ashelford, A.J. Weightman, J.C. Fry, PRIMROSE: a computer
program for generating and estimating the phylogenetic range of 16S
rRNA oligonucleotide probes and primers in conjunction with the RDPII database, Nucleic Acids Res. 30 (2002) 3481–3489.
[2] G.E. Fox, A.K. Naik, The evolutionary history of the translation
machinery, in: L. Ribas de Pouplana (Ed.), The Genetic Code
and the Origin of Life, Landes Bioscience/Eurekah.com, Kluwer
Academic/Plenum, Georgetown, Tex., New York, NY, 2004, pp. 92–105.
[3] W. Ludwig, O. Strunk, R. Westram, L. Richter, H. Meier, Yadhukumar,
A. Buchner, T. Lai, S. Steppi, G. Jobb, W. Forster, I. Brettske, S. Gerber,
A.W. Ginhart, O. Gross, S. Grumann, S. Hermann, R. Jost, A. Konig,
T. Liss, R. Lussmann, M. May, B. Nonhoff, B. Reichel, R. Strehlow, A.
Stamatakis, N. Stuckmann, A. Vilbig, M. Lenke, T. Ludwig, A. Bode,
K.H. Schleifer, ARB: a software environment for sequence data, Nucleic
Acids Res. 32 (2004) 1363–1371.
[4] G.J. Olsen, C.R. Woese, Archaeal genomics: an overview, Cell 89 (1997)
991–994.
[5] M. Stoffels, T. Castellanos, A. Hartmann, Design and application
of new 16S rRNA-targeted oligonucleotide probes for the
Azospirillum–Skermanella–Rhodocista-cluster, Syst. Appl. Microbiol. 24
(2001) 83–97.
[6] C. Urzi, V. La Cono, E. Stackebrandt, Design and application of two
oligonucleotide probes for the identification of Geodermatophilaceae
strains using fluorescence in situ hybridization (FISH), Environ.
Microbiol. 6 (2004) 678–685.
[7] R.F. Wang, M.L. Beggs, B.D. Erickson, C.E. Cerniglia, DNA microarray
analysis of predominant human intestinal bacteria in fecal samples, Mol.
Cell Probes. 18 (2004) 223–234.
[8] K.S. Wilson, H.F. Noller, Mapping the position of translational elongation
factor EF-G in the ribosome by directed hydroxyl radical probing, Cell
92 (1998) 131–139.
[9] D. Zhu, Y. Fofanov, R.C. Willson, G.E. Fox, ProkProbePicker (PPP): a
fast program to extract 16S rRNA-targeted probes for prokaryotes, in:
Proceedings of the 2005 International Conference on Mathematics and
Engineering Techniques in Medicine and Biological Sciences (METMBS
’05), 2005, pp. 41–47.
Dianhui Zhu is a Ph.D. candidate of the Department of Computer Science at
the University of Houston, where he received his M.Sc. degree in Computer
Science in 2003. He received his B.S. degree in Plant Protection and his
M.Sc. degree in Entomology from the Huazhong Agricultural University and
China Agricultural University in China, respectively. His main research interests include the investigation on efficient algorithms for bacterial probe/primer
design, the development of parallel implementations of algorithms for sequence analysis. He is also involved in the studies on Shine–Dalgarno motif analysis and non-coding RNA recognition in bacteria by computational
methods.
Yuriy Fofanov received his Ph.D. degree in Physics and Mathematics from
the Kuybishev (Samara) State University, USSR, in 1988. He is currently an
Assistant Professor with the Computer Science and Biology and Biochemistry
D. Zhu et al. / J. Parallel Distrib. Comput. 66 (2006) 1546 – 1551
Departments at the University of Houston, Houston, TX, USA. His main
research interests include bioinformatics.
George E. Fox received his Ph.D. in Chemical Engineering from the Syracuse
University. He is currently a Professor of Biology and Biochemistry as well
as Chemical Engineering at the University of Houston. He is also a Member
1551
of the Texas Learning and Computation Center. He was a pioneer in the use
of 5S rRNA and 16S rRNA to decipher phylogenetic relationships among
bacteria. His current research interests include RNA structure, function and
evolution as well as practical applications based on RNA information such
as the design of RNA-targeted probes for use use in microbial identification
and monitoring.