Searching
Searching
Searching
5.1 INTRODUCTION
In the worst case, the procedure takes O(n) time. This is clearly optimal,since every
element of S must be examined (when x is not in S ) before declaring failure.
Alternatively, if S is sorted in nondecreasing order, then procedure BINARY
SEARCH of section 3.3.2 can return the index of an element of S equal to x (or 0 if no
such element exists) in O(1og n) time. Again, this is optimal since this many bits are
needed to distinguish among the n elements of S.
In this chapter we discuss parallel searching algorithms. We begin by consider-
ing the case where S is sorted in nondecreasing order and show how searching can be
performed on the SM SIMD model. As it turns out, our EREW searching algorithm is
Sec. 5.2 Searching a Sorted Sequence 113
no faster than procedure BINARY SEARCH. On the other hand, the CREW
algorithm matches a lower bound on the number of parallel steps required to search a
sorted sequence, assuming that all the elements of S are distinct. When this
assumption is removed, a CRCW algorithm is needed to achieve the best possible
speedup. We then turn to the more general case where the elements of S are in random
order. Here, although the SM SIMD algorithms are faster than procedure
SEQUENTIAL SEARCH, the same speedup can be achieved on a weaker model,
namely, a tree-connected SIMD computer. Finally, we present a parallel search
algorithm for a mesh-connected SIMD computer that, under some assumptions
about signal propagation time along wires, is superior to the tree algorithm.
We assume throughout this section that the sequence S = { s , , s,, . . . , s,) is sorted in
nondecreasing order, that is, s, < s, < . - . < s,. Typically, a file with n records is
available, which is sorted on the s field of each record. This file is to be searched using s
as the key; that is, given an integer x, a record is sought whose s field equals x. If such a
record is found, then the information stored in the other fields may now be retrieved.
The format of a record is illustrated in Fig. 5.1. Note that if the values of the s fields are
not unique and all records whose s fields equal a given x are needed, then the search
algorithm is continued until the file is exhausted. For simplicity we begin by assuming
that the si are distinct; this assumption is later removed.
I I I
'i OTHER INFORMATION Figure 5.1 Format of record in file to be
searched.
114 Searching Chap. 5
processor Pi uses a variable ci that takes the value left or right according to whether
the part of the sequence Pi decides to keep is to the left or right of the element it
compared to x during this stage. Initially, the value of each ci is irrelevant and can be
,
assigned arbitrarily. Two constants c, = right and c,, = left are also wed. Follow-
ing the comparison between x and an element sj, of S, Pi assigns a value to ci (unless
,
sji = x, in which case the value of ci is again irrelevant). If ci # ci - for some i,
1 < i < N , then the sequence to be searched next runs from s, t~os,, where
q = (i - 1)(N + +
1 and r = i(N + - 1. Precisely one processor updates q
and r in the shared memory, and all remaining processors can simultaneoiusly read the
updated values in constant time. The algorithm is given in what follows as procedure
CREW SEARCH. The procedure takes S and x as input: If x = s, for some k, then k is
returned; otherwise a 0 is returned.
procedure CREW SEARCH (S, x, k)
Step 1: {Initialize indices of sequence to be searched}
(1.1) q t 1
(1.2) r t n.
Step 2: {Initialize results and maximum number of stages}
(2.1) k c 0
+ +
(2.2) g + rlog(n I ) / I O ~ ( N1)1.
Step 3: while (q < r and k = 0) do
(3.1) jo t q - 1
(3.2) for i = 1 to N do in parallel
+ +
(i) ji c (q - 1) i(N l)e-'
{ P icompares x to sj and determines the part of the sequence to be kept}
(ii) if ji < r
then if sji = x
then k +ji
else if sj, > x
then ci t left
else ci + right
end if
end if
+
else (a) ji + r 1
(b) ci + left
end if
{The indices of the subsequence to be searched in the next iteration are
computed}
(iii) ifci#ci-,then(a) q t j i - l 1+
(b) r + j i - 1
end if
(iv) if (i = N and ci # ci+,) then q tji 1+
end if
end for
(3.3) g + g - 1.
end while.
116 Searching Chap. 5
Analysis
Steps 1,2, 3.1, and 3.3 are performed by one processor, say, P,, in constant time. Step
3.2 also takes constant time. As proved earlier, there are at most g iterations of step 3.
It follows that procedure CREW SEARCH runs in O(log(n + l)/log(N + 1)) time, that
is, t(n) = O(log,+,(n + 1)). Hence c(n) = O(N log,+,(n +
I)), which is not optimal.
Example 5.1
Let S = {1,4,6,9, 10, 11, 13, 14, 15, 18,20,23, 32,45,51) be the sequence to be searched
using a CREW SM SIMD computer with N processors. We illustrate two successful and
one unsuccessful searches.
1. Assume that N = 3 and that it is required to find the index k of the element in S
equal to 45 (i.e., x = 45). Initially, q = 1, r = 15, k = 0, and g = 2. During the first
iteration of step 3, P , computes j , = 4 and compares s, to x. Since 9 < 45,
c , = right. Simultaneously, P , and P , compares, and s,,, respectively, to x :Since
14 < 45 and 23 < 45, c, = right and c, = right. Now c, f c,; therefore q = 13 and
r remains unchanged. The new sequence to be searched runs from s , , t o s,,, as
shown in Fig. 5.4(a), and g = 1. In the second iteration, illustrated in Fig. 5.4(b), P I
+
computes j , = 12 1 and compares s,, to x : Since 32 < 45, c , = right. Simulta-
neously, P , compares s,, to x , and since they are equal, it sets k to 14 (c, remains
unchanged). Also, P , compares s,, to x : Since 51 > 45, c, = left. Now c, # c,:
Thus q = 12 + 2 + 1 = 15 and r = 12 + 3 - 1 = 14. The procedure terminates
with k = 14.
2. Say now that x = 9, with N still equal to 3. In the first iteration, P I compares s, to
x : Since they are equal, k is set to 4. All simultaneous and subsequent com-
putations in this iteration are redundant since the following iteration is not
performed and the procedure terminates early with k = 4.
Sec. 5.2 Searching a Sorted Sequence
1. Under the assumption that the elements of S are sorted and distinct, procedure
CREW SEARCH, although not cost optimal, achieves the best possible running
time for searching. This can be shown by noting that any algorithm using N
processors can compare an input element x to at most N elements of S
simultaneously. After these comparisons and the subsequent deletion of ele-
ments from S definitely not equal to x, a subsequence must be left whose length
is at least
r(n - N)/(N + 1)1 2 (n - N)/(N + 1) = [(n + 1)/(N + I)] - 1.
After g repetitions of the same process, we are left with a sequence of length
[(n+ 1)/(N + I)#] - 1. It follows that the number of iterations required by any
such parallel algorithm is no smaller than the minimum g such that
[(n + l)/(N + -1 < 0,
which is
2. Two parallel algorithms were presented in this section for searching a sequence
of length n on a CREW SM SIMD computer with N processors. The first
required O(log(n/N)) time and the second O(log(n + l)/log(N + 1)). In both
cases, if N = n, then the algorithm runs in constant time. The fact that the
elements of S are distinct still remains a condition for achieving this constant
running time, as we shall see in the next section. However, we no longer need S
to be sorted. The algorithm is simply as follows: In one step each Pi,i = 1, 2,
. . . , n, can read x and compare it to si;if x is equal to one element of S, say, s,,
then P, returns k; otherwise k remains 0.
In the previous two sections, we assumed that all the elements of the sequence S to be
searched are distinct. From our discussion so far, the reason for this assumption may
have become apparent: If each siis not unique, then possibly more than one processor
will succeed in finding a member of S equal to x. Consequently, possibly several
Sec. 5.3 Searching a Random Sequence 119
processors will attempt to return a value in the variable k, thus causing a write
conflict, an occurrence disallowed in both the EREW and CREW models. Of course,
we can remove the uniqueness assumption and still use the EREW and CREW
searching algorithms described earlier. The idea is to invoke procedure {STORE(see
problem 2.13) whose job is to resolve write conflicts: Thus, in O(log N) time we can get
the smallest numbered of the successful processors to return the index k it has
computed, where s, = x. The asymptotic running time of the EREW search algorithm
in section 5.2.1 is not affected by this additional overhead. However, procedure
CREW SEARCH now runs in
t(n) = O(log(n + l)/log(N + 1)) + O(1og N).
In order to appreciate the effect of this additional O(log N) term, note that when
N = n, t(n) = O(1og n). In other words, procedure CREW SEARCH with n processors
is no faster than procedure BINARY SEARCH, which runs on one processor!
Clearly, in order to maintain the efficiency of procedure CREW SEARCH while
giving up the uniqueness assumption, we must run the algorithm on a CRCW SM
SIMD computer with an appropriate write conflict resolution rule. Whatever the rule
and no matter how many processors are successful in finding a member of S equal to
x, only one index k will be returned, and that in constant time.
We now turn to the more general case of the search problem. Here the elements of the
sequence S = {s,, s,, . .., s,) are not assumed to be in any particular order and are not
necessarily distinct. As before, we have a file with n records that is to be searched using
the s field of each record as the key. Given an integer x, a record is sought whose s field
equals x; if such a record is found, then the information stored in the other fields may
now be retrieved. This operation is referred to as querying the file. Besides querying,
search is useful in file maintenance, such as inserting a new record and updating or
deleting an existing record. Maintenance, as we shall see, is particularly easy when the
s fields are in random order.
We begin by studying parallel search algorithms for shared-mernory SIMD
computers. We then show how the power of this model is not really needed for the
search problem. As it turns out, performance similar to that of SM SIMD algorithms
can be obtained using a tree-connected SIMD computer. Finally, we demonstrate that
a mesh-connected computer is superior to the tree for searching if signal propagation
time along wires is taken into account when calculating the running time of
algorithms for both models.
Analysis
We now analyze procedure SM SEARCH for each of the four incarnations of the
shared-memory model of SIMD computers.
5.3.1.2ERCW. Steps 1 and 2 are as in the EREW case, while step 3 now
takes constant time. The overall asymptotic running time remains unchanged.
5.3.1 -3CREW. Step 1 now takes constant time, while steps 2 and 3 are as in
the EREW case. The overall asymptotic running time remains unchanged.
which is optimal.
Sec. 5.3 Searching a Random Sequence 1 21
In the case of the EREW, ERCW, and CREW models, the time to process one query
is now O(1og n). For q queries, this time is simply multiplied by a factor of q. This is of
course an improvement over the time required by procedure SEQUENTIAL
SEARCH, which would be on the order of qn. For the CRCW compute]:, procedure
SM SEARCH now takes constant time. Thus q queries require a constant multiple of
q time units to be answered.
Surprisingly, a performance slightly inferior to that of the CRCW algorithm but
still superior to that of the EREW algorithm can be obtained using a much weaker
model, namely, the tree-connected SIMD computer. Here a binary tree with O ( n )
processors processes the queries in a pipeline fashion: Thus the q queries require a
+
constant multiple of log n (q - 1 ) time units to be answered. For large: values of q
(i.e., q > log n), this behavior is equivalent to that of the CRCW algoritl-~m.We now
turn to the description of this tree algorithm.
1. receiving one input from its parent, making two copies of it, and sending one
copy to each of its two children; and
2. receiving two inputs from its children, combining them, and passing the result to
its parent.
The next two sections illustrate how the file stored in the leaves can be queried and
maintained.
ROOT A
INTERMEDIATE
LEAF
in three stages:
Stage I : The root reads x and passes it to its two children. In turn, these send x
to their children. The process continues until a copy of x reaches each leaf.
Stage 2: Simultaneously, all leaves compare the s field of the record they store to
x: If they are equal, the leaf produces a 1 as output: otherwise a 0 is produced.
Stage 3: The outputs of the leaves are combined by going upward in the tree:
Each intermediate node computes the logical or of its two inputs (i.e., 0 or 0 = 0,
0 or 1 = 1, 1 or 0 = 1, and 1 or 1 = 1) and passes the result to its parent. The
process continues until the root receives two bits, computes their logical or, and
produces either a 1 (for yes) or a 0 (for no).
It takes O(1og n) time to go down the tree, constant time to perform the comparison at
the leaves, and again O(1og n) time to go back up the tree. Therefore, such a query is
answered in O(log n) time.
Example 5.2
Let S = {25,14,36,18,15, 17,19,17) and x = 17. The three stages above are illustrated in
Fig. 5.6.
Assume now that q such queries are queued waiting to be processed. They can
be pipelined down the tree since the root and intermediate nodes are free to handle the
next query as soon as they have passed the current one along to their children. The
same remark applies to the leaves: As soon as the result of one comparison has been
Sec. 5.3 Searching a Random Sequence
(a) STAGE 1
(b) STAGE 2
produced, each leaf is ready to receive a new value of x. The results are also pipelined
upward: The root and intermediate nodes can compute the logical or of the next pair
of bits as soon as the current pair has been cleared. Typically, the root and
intermediate nodes will receive data flowing downward (queries) and upward (results)
simultaneously: We assume that both can be handled in a single time unit; otherwise,
and in order to keep both flows of data moving, a processor can switch :its attention
from one direction to the other alternately. It takes O(1og n) time for the answer to the
124 Searching Chap. 5
first query to be produced at the root. The answer to the second query is obtained in
the following time unit. The answer to the last query emerges q - 1 time units after the
first answer. Thus the q answers are obtained in a total of O(log n) + O(q) time.
We now examine some variations over the basic form of a query discussed so far.
1. Position If a query is successful and element s, is equal to x, it may be desired
to know the index k. Assume that the leaves are numbered 1 , . . . , n and that leaf i
contains si. Following the comparison with x, leaf i produces the pair (1, i) if si = x;
otherwise it produces (0, i). All intermediate nodes and the root now operate as
follows. If two pairs (1, i) and (0, j)are received, then the pair (1, i) is sent upward.
Otherwise, if both pairs have a 1 as a first element or if both pairs have a 0 as a first
element, then the pair arriving from the left son is sent upward. In this way, the root
produces either
With this modification, the root in example 5.2 would produce (1,6).
This variant of the basic query can itself be extended in three ways:
(a) When a record is found whose s field equals x, it may be desirable to obtain the
entire record as an answer to the query (or perhaps some of its fields). The
preceding approach can be generalized by having the leaf that finds a match
return a triple of the form (1, i, required information). The intermediate nodes
and root behave as before.
(b) Sometimes, the positions of all elements equal to x in S may be needed. In this
case, when an intermediate node, or the root, receives two pairs (1, i) and (1, j),
two pairs are sent upward consecutively. In this way the indices of all members
of S equal to x will eventually emerge from the root.
(c) The third extension is a combination of (a) and (b): All records whose s fields
match x are to be retrieved. This is handled by combining the preceding two
solutions
It should be noted, however, that for each of the preceding extensions care must be
taken with regards to timing if several queries are being pipelined. This is because the
result being sent upward by each node is no longer a single bit but rather many bits of
information from potentially several records (in the worst case the answer consists of
the n entire records). Since the answer to a query is now of unpredictable length, it is
no longer guaranteed that a query will be answered in O(1og n) time, that the period is
constant, or that q queries will be processed in O(log n) + O(q) time.
2. Count Another variant of the basic query asks for the number of records
whose s field equals x. This is handled exactly as the basic query, except that now the
Sec. 5.3 Searching a Random Sequence 125
intermediate nodes and the root compute the sum of their inputs (instead d the logical
or). With this modification, the root in example 5.2 would produce a 2.
3. Closest Element Sometimes it may be useful to find the element of S whose
value is closest to x. As with the basic query, x is first sent to the leaves.. Leaf i now
computes the absolute value of si - x, call it a,, and produces (i, a,) as output.
Each intermediate node and the root now receive two pairs (i, a,) and (j,aj): The
pair with the smaller a component is sent upward. With this modification and x = 38
as input, the root in example 5.2 would produce (3,2) as output. Note that the case of
two pairs with identical a components is handled either by choosing one of the two
arbitrarily or by sending both upward consecutively.
4. Rank The rank of an element x in S is defined as the number of ellements of S
smaller than x plus 1. We begin by sending x to the leaves and then having each leaf i
produce a 1 if si < x, and a 0 otherwise. Now the rank of x in S is computed by making
all intermediate nodes add their inputs and send the result upward. The root adds 1 to
the sum of its two inputs before producing the rank. With this modification, the root's
output in example 5.2 would be 3.
It should be emphasized that each of the preceding variants, if car~efullytimed,
should have the same running time as the basic query (except, of course, when the
queries being processed d o not have constant-length answers as pointed out earlier).
A new record received by the root is inserted into an unoccupied leaf as follows:
(i) The root passes the record to the one of its two subtrees with unoccupied leaves.
If both have unoccu,pied leaves, the root makes an arbitrary decision; if neither
does, the root signals an overflow situation.
(ii) When an intermediate node receives the new record, it routes it to its subtree
with unoccupied leaves (again, making an arbitrary choice, if necessary).
(iii) The new record eventually reaches an unoccupied leaf where it is stored.
Note that whenever the root, or an intermediate node, sends the new record to a
subtree, the number of unoccupied leaves associated with that subtreee is decreased by
126 Searching Chap. 5
1. It should be clear that insertion is greatly facilitated by the fact that the file is not to
be maintained in any particular order.
2. Update Say that every record whose s field equals x must be updated with
new information in (some of) its other fields. This is accomplished by sending x and
the new information to all leaves. Each leaf i for which si = x implements the change.
3. Deletion If every record whose s field equals x must be deleted, then we begin
by sending x to all leaves. Each leaf i for which si = x now declares itself as unoccupied
by sending a 1 to its parent. This information is carried upward until it reaches the
root. On its way, it increments by 1 the appropriate count in each node of the number
of unoccupied leaves in the left or right subtree.
Each of the preceding maintenance operations takes O(1og n) time. As before, q
operations can be pipelined to require O(1og n) + O(q) time in total.
We conclude this section with the following observations.
1. We have obtained a search algorithm for a tree-connected computer that is
more efficient than that described for a much stronger model, namely, the EREW SM
SIMD. Is there a paradox here? Not really. What our result indicates is that we
managed to find an algorithm that does not require the full power of the shared-
memory model and yet is more efficient than an existing EREW algorithm. Since any
algorithm for an interconnection network SIMD computer can be simulated on the
shared-memory model, the tree algorithm for searching can be turned into an EREW
algorithm with the same performance.
2. It may be objected that our comparison of the tree and shared-memory
algorithms is unfair since we are using 2n - 1 processors on the tree and only n on the
EREW computer. This objection can be easily taken care of by using a tree with n/2
leaves and therefore a total of n - 1 processors. Each leaf now stores two records and
performs two comparisons for every given x.
3. If a tree with N leaves is available, where 1 < N < n, then n/N records are
stored per leaf. A query now requires
that is, a total of O(1og N) + O(n/N). This is identical to the time required by the
algorithms that run on the more powerful EREW, ERCW, or CREW SM SIMD
computers. Pipelining, however, is not as attractive as before: Searching within each
leaf no longer requires constant time and q queries are not guaranteed to be answered
in O(1og n) + O(q) time.
Sec. 5.3 Searching a Random Sequence 127
4. Throughout the preceding discussion we have assumed that the wire delay,
that is, the time it takes a datum to propagate along a wire, from one level of the tree
to the next is a constant. Thus for a tree with n leaves, each query or maintenance
operation under this assumption requires a running time of O(1og n) to be processed.
In addition, the time between two consecutive inputs or two consecutive outputs is
constant: In other words, searching on the tree has a constant period (provided, of
course, that the queries have constant-length answers). However, a direct hardware
implementation of the tree-connected computer would obviously have connections
between levels whose length grows exponentially with the level number. As Fig. 5.5
+
illustrates, the wire connecting a node at level i to its parent at level i 1 has length
proportional to 2'. The maximum wire length for a tree with n leaves is O(n)and occurs
at level log n - 1. Clearly, this approach is undesirable from a practical point of view,
as it results in a very poor utilization of the area in which the processors and wires are
placed. Furthermore, it would yield a running time of O(n) per query if the
propagation time is taken to be proportional to the wire length. In orde:r to prevent
this, we can embed the tree in a mesh, as shown in Fig. 5.7. Figure 5.7 illustrates an n-
INTERMEDLATE
NODE
This is a definite improvement over the previous design, but not sufficiently so to
make the tree the preferred architecture for search problems. In the next section we
describe a parallel algorithm for searching on a mesh-connected SIMD computer
whose behavior is superior to that of the tree algorithm under the linear propagation
time assumption.
INPUTIOUTPUT
4 1 1 ) - - - p(1,2) 3 P(1.4)
,,,,
neighbors P ( 1 , l ) and P ( 1 , 2 ) send ( b x ) and (b,,,, x ) to P ( 2 , l ) and P(2,2),
respectively. Once b,,, and b,,, have been computed, the two column neighbors
P ( 1 , 2 )and P(2,2)communicate (b,,,, x ) and (b,,,, x ) to P(1,3) and P(2,3), respectively.
This unfolding process, which alternates row and column propagation, cointinues until
x reaches P(n'I2, n1I2).
Folding. At the end of the unfolding stage every processor has had a chance to
"see" x and compare it to the s field of the record it holds. In this second stage, the
reverse action takes place. The output bits are propagated from row to row and from
column to column in an alternating fashion, right to left and bottom to top, until the
answer emerges from P ( l , 1). The algorithm is given as procedure MESH SEARCH:
end for
(3.6) if(b,-,,i - l = 1 or bi.i-l = 1)then bi-l,i-l + l
else bi_,,i-l t 0
end if
end for.
Step 4: {P(l,l) produces the output}
if b,,, = 1 then answer + yes
else answer t no
end if.
Analysis
As each of steps 1 and 4 takes constant time and steps 2 and 3 consist of nli2 - 1
constant-time iterations, the time to process a query is O(n1'2).Notice that after the
first iteration of step 2, processor P(1,l) is free to receive a new query. The same
remark applies to other processors in subsequent iterations. Thus queries can be
processed in pipeline fashion. Inputs are submitted to P(1,l)at a constant rate. Since
the answer to a basic query is of fixed length, outputs are also produced by P(l, 1) at a
constant rate following the answer to the first query. Hence the period is constant.
Example 5.3
Let a set of 16 records stored in a 4 x 4 mesh-connected SIMD computer be as shown in
Fig. 5.9. Each square in Fig. 5.9(a) represents a processor and the number inside it is the s
Sec. 5.3 Searching a Random Sequence
field of the associated record. Wires connecting the processors are omitted for simplicity.
It is required to determine whether there exists a record with s field equal to 15 (i.e.,
x = 15). Figures 5.9(b)-5.9(h) illustrate the propagation of 15 in the arra:y. Figure 5.9(i)
shows the relevant b values at the end of step 2. Figures 5.9(j)-5.9(0) illustrate the folding
process. Finally Fig. 5.9(p) shows the result as produced in step 4. Note that in Fig. 5.9(e)
processor P(1,l) is shown empty indicating that it has done its job propagating 15 and is
now ready to receive a new query.
1. N o justification was given for transmitting bi,jalong with x during the unfolding
stage. Indeed, if only one query is t o be answered, n o processor needs t o
communicate its b value to a neighbor: All processors can compute and retain
their outputs; these can then be combined during the folding stage. However, if
132 Searching Chap. 5
several queries are to be processed in pipeline fashion, then each processor must
first transmit its current b value before computing the next one. In this way the
biVjare continually moving, and no processor needs to store its b value.
2. When several queries are being processed in pipeline fashion, the folding stage of
one query inevitably encounters the unfolding stage of another. As we did for the
tree, we assume that a processor simultaneously receiving data from opposite
directions can process them in a single time unit or that every processor
alternately switches its attention from one direction to the other.
3. It should be clear that all variations over the basic query problem described in
section 5.3.2.1 can be easily handled by minor modifications to procedure
MESH SEARCH.
5.4 P R O B L E M S
5.1 Show that C2(log n) is a lower bound on the number of steps required to search a sorted
sequence of n elements on an EREW SM SIMD computer with n processors.
5.2 Consider the following variant of the EREW SM SIMD model. In one step, a processor
can perform an arbitrary number of computations locally or transfer an arbitrary number
of data (to or from the shared memory). Regardless of the amount of processing-
(computations or data transfers) done, one step is assumed to take a constant number of
time units. Note, however, that a processor is allowed to gain access to a unique memory
location during each step (as customary for the EREW model). Let n processors be
available on this model to search a sorted sequence S = {s,, s,, . . . , s,} of length n for a
given value x. Suppose that any subsequence of S can be encoded to fit in one memory
location. Show that under these conditions the search can be performed in O(10g"~n)time.
[Hint: Imagine that the data structure used to store the sequence in shared memory is a
binary tree, as shown in Fig. 5.1qa) for n = 31. This tree can be encoded as shown in Fig.
5.10(b).]
Sec. 5.4 Problems
5.3 Prove that R(l~g''~n) is a lower bound on the number of steps required to search a sorted
sequence of n elements using n processors on the EREW SM SIMD computer of problem
5.2.
5.4 Let us reconsider problem 5.2 but without the assumption that arbitrary subsequences of
S can be encoded to fit in one memory location and communicated in one step. Instead, we
shall store the sequence in a tree with d levels such that a node at level i contains d - i
+
elements of S and has d - i 1 children, as shown in Fig. 5.11 for n = 23. Each node of
this tree is assigned to a processor that has sufficient local memory to store the elements of
S contained in that node. However, a processor can read only one element of S at every
step. The key x to be searched for is initially available to the processor in charge of the
root. An additional array in memory, with as many locations as there are processors,
allows processor P i to communicate x to P j by depositing it in the location a.ssociated with
P j . Show that O(n) processors can search a sequence of length n in O(log ,n/loglog n).
Searching Chap. 5
5.5 B l B L l O G R A P H l C A L R E M A R K S
The problem of searching a sorted sequence in parallel has attracted a good de:al of attention
since searching is an often-performed and time-consuming operation in most database,
information retrieval, and office automation applications. Algorithms similar to procedure
136 Searching Chap. 5
CREW SEARCH for searching on the EREW and CREW models, as well as variations of these
models, are described in [Coraor], [Kruskal], [Munro], and [Snir]. In [Baer] a parallel
computer is described that consists of N processors connected via a switch to M memory
blocks. During each computational step several processors can gain access to several memory
blocks simultaneously, but no more than one processor can gain access to a given memory
block (recall Fig. 1.4). A sorted sequence is distributed among the memory blocks. Various
implementations of the binary search algorithm for this model are proposed in [Baer]. A brief
discussion of how to speed up information retrieval operations through parallel processing is
provided in [Salton I].
Several algorithms for searching on a tree-connected computer are described in
[Atallah], [Bentley], [Bonuccelli], [Chung], [Leiserson 11, CLeiserson21, [Ottman],
[Somani], and [Song]. Some of these algorithms allow for records to be stored in all nodes of
the tree, while others allow additional connections among the nodes (such as, e.g., connecting
the leaves as a linear array). The organization of a commercially available tree-connected
computer for database applications is outlined in [Seaborn]. Also, various ways to implement
tree-connected computers in VLSI are provided in [Bhatt] and [Schmeck 11. An algorithm
analogous to procedure MESH SEARCH can be found in [Schmeck 21. The idea that the
propagation time of a signal along a wire should be taken as a function of the length of the wire
in parallel computational models is suggested in [Chazelle] and [Thompson].
Other parallel algorithms for searching on a variety of architectures are proposed in the
literature. It is shown in [Kung 23, for example, how database operations such as intersection,
duplicate removal, union, join, and division can be performed on one- and two-dimensional
arrays of processors. Other parallel search algorithms are described in [Boral], [Carey],
[Chang], [DeWitt I], [DeWitt 23, [Ellis I], [Ellis 21, [Fisher], [Hillyer], [Kim], [Lehman],
[Potter], [Ramamoorthy], [Salton 21, [Schuster], [Stanfill], [Stone], [Su], [Tanaka], and
[Wong]. In [Rudolph] and [Weller] the model of computation is a so-called parallel pipelined
computer, which consists of N components of M processors each. Each component can initiate
a comparison every 1/M units of time; thus up to N M comparisons may be in progress at one
time. The algorithms in [Rudolph] and [Weller] implement a number of variations of binary
search. Several questions related to querying and maintaining files on an M I M D computer are
addressed in [Kung 11,[Kwong I], and [Kwong 21. Parallel hashing algorithms are presented
in [Miihlbacher]. Finally, parallel search in the continuous case is the subject of [Gal] and
[Karpl.
5.6 R E F E R E N C E S
[ATALLAH]
Atallah, M. J., and Kosaraju, S. R., A generalized dictionary machine for VLSI, IEEE
Transactions on Computers, Vol. C-34, No. 2, February 1985, pp. 151-155.
[BAER]
Baer, J.-L., Du, H. C., and Ladner, R. E., Binary Search in a multiprocessing environment,
IEEE Transactions on Computers, Vol. C-32, No. 7, July 1983, pp. 667-676.
[BENTLEY]
Bentley, J. L., and Kung, H. T., Two papers on a tree-structured parallel computer, Technical
Report N O . CMU-CS-79-142, Department of Computer Science, Carnegie-Mellon Univers-
ity, Pittsburgh, August 1979.
Sec. 5.6 References 137
[BHATT]
Bhatt, S. N., and Leiserson, C. E., How to assemble tree machines, Proceedings of the 14th
Annual ACM Symposium on Theory of Computing, San Francisco, California, May 1982,
pp. 77-84, Association for Computing Machinery, New York, N.Y., 1982.
[BONUCCELLI]
Bonuccelli, M. A., Lodi, E., Lucio, F., Maestrini, P., and Pagli, L., A VLSI tree machine for
relational data bases, Proceedings of the 10th Annual ACM International Symposium on
Computer Architecture, Stockholm, Sweden, June 1983, pp. 67-73, Association for Comput-
ing Machinery, New York, N.Y., 1983.
[BORAL]
Boral, H., and DeWitt, D. J., Database machines: An idea whose time has passed? A critique
of the future of database machines, in Leilich, H. O., and Missikoff, M., E:ds., Database
Machines, Springer-Verlag, Berlin, 1983.
[CAREY]
Carey, M. J., and Thompson, C. D., An efficient implementation of search trees on
rlog N + 11 processors, IEEE Transactions on Computers, Vol. C-33, No. 11, November 1984,
pp. 1038-1041.
[CHANG]
Chang, S.-K., Parallel balancing of binary search trees, IEEE Transactions on Computers, Vol.
C-23, No. 4, April 1974, pp. 441-445.
[CHAZELLE]
Chazelle, B., and Monier, L., A model of computation for VLSI with related complexity
results, Journal of the ACM, Vol. 32, No. 3, July 1985, pp. 573-588.
[CHUNG]
Chung, K. M., Lucio, F., and Wong, C. K., Magnetic bubble memory structures for efficient
sorting and searching, in Lavington, S. H., Ed., Information Processing 80, North-Holland,
Amsterdam, 1980.
[CORAOR]
Coraor, L. D., A multiprocessor organization for large data list searches, Ph.D. thesis,
Department of Electrical Engineering, University of Iowa, Iowa City, July 1978.
[DEWITT11
DeWitt, D. J., DIRECT-a multiprocessor organization for supporting relational database
management systems, IEEE Transactions on Computers, Vol. C-28, No. 6, June 1979, pp. 395-
406.
[DEWITT21
DeWitt, D. J., and Hawthorn, P. B., A performance evaluation of database machine
architectures, Proceedings of the 7th International Conference on Very Larg~eData Bases,
Cannes, France, September 1981, pp. 199-213, VLDB Endowment, Cannes, France, 1981.
[ELLIS 11
Ellis, C., Concurrent search and insertion in 2-3 trees, Acta lnformatica, Vol. 14, 1980, pp. 63-
86.
[ELLIS 21
Ellis, C., Concurrent search and insertion in AVL trees, IEEE Transactions on Computers,
Vol. C-29, No. 9, September 1980, pp. 811-817.
[FISHER]
Fisher, A. L., Dictionary machines with a small number of processors, Proct:edings of the
1 38 Searching Chap. 5
[MUNRO]
Munro, J. I., and Robertson, E. L., Parallel algorithms and serial data structures., Proceedings
of the 17th Annual Allerton Conference on Communications, Control and Computing,
Monticello, Illinois, October 1979, pp. 21-26, University of Illinois, Urbana-Champaign,
Illinois, 1979.
[OTTMAN]
Ottman, T. A., Rosenberg, A. L., and Stockmeyer, L. J., A dictionary machine (for VLSI),
IEEE Transactions on Computers, Vol. C-31, No. 9, September 1982, pp. 892-897.
[POTTER]
Potter, J. L., Programming the MPP, in Potter, J. L., Ed., The Massively Parallel Processor,
MIT Press, Cambridge, Mass., 1985, pp. 218-229.
[RAMAMOORTHY]
Ramamoorthy, C. V., Turner, J. L., and Wah, B. W., A design of a fast cellula~rassociative
memory for ordered retrieval, IEEE Transactions on Computers, Vol. C-27, No. 9, September
1978, pp. 800-815.
[RUDOLPH]
Rudolph, D., and Schlosser, K.-H., Optimal searching algorithms for parallel pipelined
computers, in Feilmeier, M., Joubert, J., and Schendel, U., Eds., Parallel Computing 83,
North-Holland, Amsterdam, 1984.
[SALTON11
Salton, G., Automatic information retrieval, Computer, Vol. 13, No. 9, September 1980, pp.
41-56.
[SALTON21
Salton, G., and Buckley, C., Parallel text search methods, Communications of the ACM, Vol.
31, No. 2, February 1988, pp. 202-215.
[SCHMECK 11
Schmeck, H., On the maximum edge length in VLSI layouts of complete binary trees,
Information Processing Letters, Vol. 23, No. 1, July 1986, pp. 19-23.
[SCHMECK 23
Schmeck, H., and Schroder, H., Dictionary machines for different models of VLSI, IEEE
Transactions on Computers, Vol. C-34, No. 2, February 1985, pp. 151-155.
[SCHUSTER]
Schuster, S. A., Ngyuen, H. B., and Ozkarahan, E. A., RAP.2: An associative processor for
databases and its applications, IEEE Transactions on Computers, Vol. C-28, No. 6, June 1979,
pp. 446-458.
[SEABORN]
Seaborn, T., The genesis of a database computer, Computer, Vol. 17, No. 11, November 1984,
pp. 42-56.
[SNIR]
Snir, M., On parallel searching, SIAM Journal on Computing, Vol. 14, NO. 3, Aul:ust 1985, pp.
688-708.
[SOMANI]
Somani, A. K., and Agarwal, V. K., An efficient VLSI dictionary machine, Proce:edings of the
11th Annual ACM International Symposium on Computer Architecture, Ann Arbor,
Michigan, June 1984, pp. 142-150, Association for Computing Machinery, New York, N.Y.,
1984.
140 Searching Chap. 5
[SONG]
Song, S. W., A highly concurrent tree machine for database applications, Proceedings of the
1980 International Conference on Parallel Processing, Harbor Springs, Michigan, August
1980, pp. 259-268, IEEE Computer Society, Washington, D.C., 1980.
[STANFILL]
Stanfill, C., and Kahle, B., Parallel free text search on the connection machine system,
Communications of the ACM, Vol. 29, No. 12, December 1986, pp. 1229-1239.
[STONE]
Stone, H. S., Parallel querying of large databases: A case study, Computer, Vol. 20, No. 10,
October 1987, pp. 11-21.
CSul
Su, S. Y. W., Associative programming in CASSM and its applications, Proceedings of the
3rd International Conference on Very Large Data Bases, Tokyo, Japan, October 1977, pp.
213-228, VLDB Endowment, Tokyo, Japan, 1977.
[TANAKA]
Tanaka, Y., Nozaka, Y., and Masuyama, A., Pipeline searching and sorting modules as
components of a data flow database computer, in Lavington, S. H., Ed., Information
Processing 80, North-Holland, Amsterdam, 1980.
[THOMPSON]
Thompson, C. D., The VLSI complexity of sorting, I E E E Transactions on Computers, Vol. C-
32, No. 12, December 1983, pp. 1171-1184.
[WELLER]
Weller, D. L., and Davidson, E. S., Optimal searching algorithms for parallel-pipelined
computers, in Goos, G., and Hartmanis, J., Ed., Parallel Processing, Springer-Verlag, Berlin,
1975, pp. 291-305.
[WONG]
Wong, C. K., and Chang, S.-K., Parallel generation of binary search trees, I E E E Transactions
on Computers, Vol. C-23, No. 3, March 1974, pp. 268-271.