Lec. 4: Parallel Computation
Lec. 4: Parallel Computation
Lec. 4: Parallel Computation
In this lecture we will address complexity questions with respect to parallel computation. In
sequential computation we work under the assumpition that we have a single processor whereas in
parallel computation, the number of processors is large and the question now is: ‘Can we compute
faster in parallel than sequential ?’
4.1 Parallelization
In complexity theory we associate parallel computation with low depth circuits. The output of all
the gates at depth i (from the leaf) can be processed at time instant i. We will now formally define
boolean circuits.
Definition 4.1 (Boolean Circuits). A boolean circuit is a directed acyclic graph whose leaf nodes
are labelled with inputs (x1 , x2 , ..., xn ) and the intermediate nodes are labelled with AND(∧), OR(∨)
and NOT(¬) gates. The root (having only incoming edges) of the boolean circuit computes a boolean
formula.
Two important notions that we associate with a circuit is its size and depth. The size of a
circuit is the number of edges in the circuit and the depth of a circuit is the length of the longest
path from the root to a leaf in the circuit.
An important difference between the notions of Turing machines and boolean circuits is that in the
case of Turing machines the inputs can be of arbitrary length whereas in circuits the input length is
fixed. Hence we associate a family of circuits {C1 , C2 , ..., Cn , ...} with a problem, where Ci denotes
the circuit for input length i. It is known (cook-levin reduction), how to convert an algorithm for a
fixed input size to a boolean circuit with the same input size, and so associating a family of circuits
with a problem\algorithm is very natural.
Say we are doing the computation sequentially, then it is easy to see that the total time taken
by a processor to mirror the circuit computation is at least the size of the circuit. In other words
the size of the circuit is the total work done by the algorithm which is equal to the time taken by
the sequential algorithm. The depth of the circuit corresponds to the parallel time complexity of
the algorithm. Since all the gates at depth i (from the leaf) will be processed at time instant i, the
root will be processed in d time units, where d is the depth of the circuit.
An important question to ask at this juncture is whether all computations can be parallelized?
It can be seen that any boolean function can be trivially parallelized using exponential sized circuit
of depth 2 (exponential in terms of the number of inputs). This can be done by writing the boolean
function in the CNF form and the resulting circuit has an OR gate at the top (with possibly ex-
ponential fan-in) and AND gates at layer 2 (from the root). We will explain this with an example.
Let f (x1 , x2 , x3 ) be a boolean function with the following truth table.
4-1
x1 x2 x3 f (x1 , x2 , x3 )
0 0 0 1
0 0 1 0
0 1 0 1
0 1 1 0
1 0 0 1
1 0 1 1
1 1 0 0
1 1 1 1
It is easy to infer from the above example that f (x1 , x2 , x3 ) has a depth 2 circuit with OR gate at
root and AND gates at level 2 (from the root). So we need to modify the question and ask whether
all computations can be efficiently parallelized? We don’t know the answer to this question. It is
conjectured that all computations may not be efficiently parallelized. Note here efficiently implies
that the size of the circuit is polynomial in the size of the input.
When we perform addition of two n bit integers, it is immediately clear that the ith bit, oi of
the output o depends on all the input bits ai−1 , ..., a1 and bi−1 , ..., b1 . This is because the carry
ci in the ith position depends on the carry in the previous positions. Hence it is not immediately
clear whether we can do this in constant time in parallel. Say somehow we are able to compute the
carries in parallel then we can perform addition in constant time in parallel. So our objective is to
compute the carry bits in parallel.
Although the carry bit at the ith position depends on the input bits at the previous positions,
we know exactly how it depends. For example we need to compute the carry bit ci , then ci is 1 if
and only if both ai−1 and bi−1 are 1 or either ai−1 or bi−1 is 1 and ci−1 is 1. We can similarly write
ci−1 in terms ai−2 , bi−2 and ci−2 . Eventually we can write ci just in terms of ai−1 , ai−2 , ..., a1 and
bi−1 , bi−2 , ..., b1 . Thus
ci = (ai−1 ∧ bi−1 ) ∨ ((ai−1 ⊕ bi−1 ) ∧ (ai−2 ∧ bi−2 )) ∨ ((ai−1 ⊕ bi−1 ) ∧ (ai−2 ⊕ bi−2 ) ∧ (ai−3 ∧ bi−3 )) ∨ ...
4-2
Figure 1: Circuit for a carry bit
Theorem 4.1. For n bit integer addition there is a circuit of depth 4 and size O(n3 ) with ∨ and
∧ gates having fan-in O(n) and ⊕ gates having fan-in at most 3.
Proof. It is clear from above ci can be constructed by a circuit of depth 3 as shown in figure 1.
Once we have ci , oi = ai ⊕ bi ⊕ ci which can be computed by an XOR gate with fan-in 3. We will
now compute the size of the required circuit using recursion. A quick observation gives us that that
the fan-in of both ∨ and ∧ gates are bounded by O(n). This immediately gives us the recursion
S(ci ) = O(n) + S(ci−1 ) where S(ci ) represents the size of the circuit computing the ith bit of the
carry. Thus it requires O(n2 ) sized circuit to compute a bit of the carry. Since there are n carry
bits the total size of the circuit is bounded by O(n3 ).
If we wish for AND, OR and XOR gates to have fan-in bounded by 2 we would get O(log n)
depth circuit of the same size:O(n3 ), by replacing every gate with fan-in more than 2 by a complete
binary tree structure. This gives us the following corollary
Corollary 4.2. For n bit integer addition there is a circuit of depth O(log n) and size O(n3 ) with
all gates having fan-in at most 2.
The next obvious question along this line of thought is whether we can extend this to addition
of n integers each of n bits, i.e whether we can perform iterated addition using constant depth
circuits. It is known that this can’t be done due to the following theorem.
Theorem 4.3 ([AKS83],[FSS84]). Any 4 depth circuit for computing the parity of n bits using
1
4)
AND, OR and NOT gates (of unbounded fan-in) must have size 2O(n
This was later made optimal by Håstad. Using his switching lemma [Hås86], Håstad showed
that any constant depth circuit computing the parity function requires exponential size. The least
significant bit in the sum of n integers each of n bits is the parity of the least significant bits of
each of the inputs. Thus it follows immediately from theorem 4.3. that iterated addition can not
4-3
be performed by polynomial sized and constant depth circuits. In the next class we will see that
integer multiplication, iterated integer addition and iterated integer multiplication belong to a class
of circuits called TC0 .
References
[AKS83] Mikls Ajtai, Jnos Komls, and Endre Szemerdi. An 0(n log n) sorting network. In Proc. 15th Annual ACM
Symposium on the Theory of Computing. ACM, 1983.
[FSS84] Merrick Furst, James B Saxe, and Michael Sipser. Parity, circuits, and the polynomial-time hierarchy. Math.
Syst. Theory, 1984.
[Hås86] J. Håstad. Almost otimal lower bounds for small depth circuits. In Proc. 18th Annual ACM Symposium
on the Theory of Computing. ACM, 1986.
4-4