1 Parallel and Distributed Computation
1 Parallel and Distributed Computation
1 Parallel and Distributed Computation
2 PRAM Model
A PRAM machine consists of m synchronous processors with shared memory.
This model ignores synchronization problems and communication issues, and
concentrates on the task of parallelization of the problem. One gets various
variations of this model depending on how various processors are permitted
to access the same memory location at the same time.
• ER= Exclusive Read, only one processor can read a location in any 1
step
• EW= Exclusive write, only one processor can write a location in any
1 step
The “right” model is probably an EREW PRAM, but we will study other
models as academic exercises. We will sometimes refer to algorithms by the
type of model that these algorithms are designed for, e.g. an EREW PRAM
algorithm.
1
We define T (n, p) to be parallel time for the algorithm under consider-
ation with p processors on input of size n Let S(n) be the sequential time
complexity. Then the efficiency of a parallel algorithm is defined by
S(n)
E(n, p) =
mT (n, p)
Efficiencies can range from 0 to 1. The best possible efficiency you can have
is 1. Generally we prefer algorithm whose efficiency is Ω(1).
The folding principle can be stated in two equivalent ways:
3 The OR Problem
INPUT: bits b1 , . . . bn
OUTPUT: The logical OR of the bits
One can obtain an EREW Algorithm with T (n, p) = n/p + log p using
a divide and conquer algorithm that is perhaps best understood as a binary
tree. The p leaves of the binary tree are n/p bits. Each internal node is a
processor that OR’s the output of its children. The efficiency of the EREW
Algorithm is
S(n) n n
E(n, p) = = =
pT (n, p) p(n/p + log p) p + p log p
2
4 MIN Problem
See section 10.2.1, and section 10.2.2.
INPUT: Integers x1 , . . . , xn
OUTPUT: The smallest xi
The results are essentially the same as for the OR problem. There is an
EREW divide and conquer algorithm with E(n, n/ log n) = Θ(1). Note that
this technique works for any associative operator (both OR and MIN are
associative).
There is an CRCW Common algorithm with T (n, p = n2 ) = 1 and
E(n, p = n2 ) = 1/n. Here’s code for processor Pi,j , 1 ≤ i, j ≤ j for the
CRCW Common algorithm to compute the location of the minimum num-
ber:
What happens when the above code is run in the various CW models if
there are two smallest numbers?
What happens in the various CW models if there are two smallest num-
bers and you just want to compute the value of the smallest number, that is
if the last line is changed to
and
S2i−1 = (x1 + x3 + . . . + x2i−1 ) + (x2 + x4 + . . . + x2i−2 )
3
This gives an algorithm with time T (n, n) = log n on EREW PRAM. This
can be improved to T (n, n/ log n) = log n, thus giving Θ(1) efficiency.
Note that divide and conquer into the first half and last half is more
difficult because of the sum for the first half becomes a bottleneck that all of
the last half of the processors want to access. Also note that this technique
works for any associative operator.
For i = 1 to n do
For j=1 to n do
D[i,j]=weight of edge (i, j)
The correctness of this procedure can be seen using the following loop
invariant: After t times through the repeat loop, for all i and j, if the length
of the shortest path between vertices i and j that has 2t or less edges is equal
to D[i, j].
A “parallel for loop” is a loop where all operations can be executed in
parallel, for example:
Question: So which loops can be replaced by parallel for loops? Answer: The
second and third. This gives T (n, p = n2 ) = n log n on a CREW PRAM.
The fourth loop could be replaced by a parallel for loop on a machine with
concurrent write machine that always writes the smallest value. But note
4
that the fourth loop is just computing a minimum, which is an associa-
tive operator. Thus using the standard algorithm the compute the value of
an associative operator in time log n with n/ log n processors, we get time
T (n, n3 / log n) = log n on an CREW PRAM.
Question: What is the efficiency? Answer: It depends what you use for
S(n):
• S(n) = T (n, 1). This measures speed-up of the parallel algorithm, but
doesn’t give speed up over standard sequential algorithm.
• S(n) = best achievable sequential time. But for almost all problems,
the best achievable sequential time is not known.
Note that there are simple sequential shortest path algorithms that run in
time O(n3 ), and complicated ones that run in time something like O(n2.4 ).
7 Odd-Even Merging
See section 10.2.1.
INPUT: Sorted lists x1 , . . . , xn , and y1 , . . . , yn .
OUTPUT: The two lists merged into one sorted list z1 , . . . , z2n
We give the following divide and conquer algorithm
Merge(x_1, x_3, x_5, y_2, y_4, y_6 ...) to get a_1 ... a_n
Merge(x_2, x_4, x_6, y_1, y_3, y_5 ...) to get b_1 ... b_n
for i=1 to n do z_2i-1=min(a_i, b_i)
z_2i =max(a_i, b_i)
5
the correctness of the algorithm. Each ai is greater than or equal a1 , . . . , ai .
Each ai , i > 1 is larger than bi−1 . Hence, ai ≥ z2i−1 . Each bi is greater than
or equal a1 , . . . , bi . Each bi , i > 1, is larger than ai−1 . Hence, bi ≥ z2i−1 .
This same argument shows ai+1 ≥ z2i+1 and bi+1 ≥ z2i+1 . So z2i−1 and z2i
must be ai and bi .
d[n]=0
6
if next[i] != nil then
d[i]=d[i] + d[next[i]]
next[i]=next[next[i]]
The correctness of the code follows from the following loop invariant: The
position of i equals d[i] + d[next[i]] + d[next[next[i]] + ...
Note that this is essentially solving the parallel prefix problem, with the
work done before the recursion instead of after. To solve the parallel prefix
problem we would replace the initialization
for i= 1 to n-1 in parallel do
d[i]=1
by
for i= 1 to n-1 in parallel do
d[i]=x[i]
A
/ \
B C
/ \
D E
7
11 Expression Evaluation
The input is an algebraic expression in the form of a binary tree, with the
leaves being the elements, and the internal nodes being the algebraic oper-
ations. The goal is to compute the value of the expression. Some obvious
approaches won’t work are: 1) Evaluate nodes when both values of children
are known, and 2) parallel prefix. The first approach won’t give you a speed
up if the tree is unbalanced. The second approach won’t work if the operators
are not be associative.
First assume that the only operation is subtraction. We label edges by
functions. We now define the cut operation. If we have a subtree that looks
like
| h(x)
-
/ \
f(x)/ \ g(x)
- constant c
/ \
A B
and cut on the root of this subtree we get
| h(f(x) - g(c))
-
/ \
/ \
A B
| h(x)
-
/ \
8
f(x)/ \ g(x)
constant c -
/ \
A B
| h(f(c) - g(x))
-
/ \
/ \
A B
Thus we are left with finding a class of functions, with the base elements being
constants, that are closed under composition, subtraction of constants, and
subtraction from constants. This class is the functions of the form ax + b, a
is +1 or −1 and b can be any number.
Note that in one step we can apply cuts to all nodes with an odd numbered
left child that is a leaf. Note that in one step we can apply cuts to all nodes
with an odd numbered right child that is a leaf. This leads to the following
algorithm:
Repeat log n times
For each internal node v in parallel
if v has odd numbered left child that is a leaf then cut at v
if v has odd numbered right child that is a leaf then cut at v
renumber the leaves using pointer doubling
Note that in log n steps we will down to a constant sized tree since each
iteration of the outer loop reduces the number of leaves by one half. So
T (n, n) = log2 n since number the left or right leaves can be done in log n
time using the Eulerian tour technique.
9
INPUT: A Boolean circuit F consisting of AND and OR, and NOT gates,
and assignment of 0/1 values to the input lines of the circuit.
OUTPUT: 1 if the circuit evaluates to be true, and 0 otherwise.
More precisely, no one knows of a parallel algorithm that runs in time
O(log k n) for some k, with a polynomial number of processors. Here n is
the size of the circuit. Further this problems is complete for polynomial
time sequential algorithms in the sense that if this problem is parallelizable
(time O(log k n) for some k, with a polynomial number of processors) then all
problems that have polynomial time sequential algorithms are parallelizable.
10