Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
21 views

Chapter 8 - Advanced Parallel Algorithms

Uploaded by

topkek69123
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Chapter 8 - Advanced Parallel Algorithms

Uploaded by

topkek69123
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Advanced Parallel

Algorithms
References
• Michael J. Quinn. Parallel Computing. Theory and Practice.
McGraw-Hill
• Albert Y. Zomaya. Parallel and Distributed Computing
Handbook. McGraw-Hill
• Ian Foster. Designing and Building Parallel Programs.
Addison-Wesley.
• Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar .
Introduction to Parallel Computing, Second Edition. Addison
Wesley.
• Joseph Jaja. An Introduction to Parallel Algorithm. Addison
Wesley.
• Nguyễn Đức Nghĩa. Tính toán song song. Hà Nội 2003.

3
8.1 Parallel Recursive

4
Divide and Conquer
• The principle of “divide and conquer" as follows:
• B1: Divide the original problem into smaller problems.
• B2: Recursive implementation with small problems.
• B3: Combine results from small problems to obtain
original problem results.
• Small problems are independent of each other so they
can be done in parallel.
• The problem is how to do steps 1 and 3 the most
effectively ???

5
Devide and Conquer

6
Complexity
• Considering the problem P having n-length, divided into
q child problems of n/k-length (each problem has k
elements, k>1), executed in parallel with p processors
t_run_trivial(n) if n is small enough
t_run_serial(n) if p = 1
t_divide_conquer t_divide(n,p) + t_combine(n,p) + q/p
(n,p)= t_divide_conquer(k,1) if 1 < p < q
t_divide(n,p) + t_combine(n,p) +
t_divide_conquer(k,p/q) if p > q or p=q

7
For Example (1)
• Sum of n numbers A[1..n] with p processors.
• Idea:
• If n = 1 return value A[1].
• If p = 1 run in serial mode.
• Divide array A into 2 parts A1 and A2, each
containing n/2 elements, executed in parallel:
• Calculate recursively S1: sum of all A1’s
elements with p/2 processors.
• Calculate recursively S2: sum of all A2’s
elements with p/2 processors.
• Get the total S = S1+S2.

8
Sum of n numbers A[1..n] with p
processors
INPUT : A[1..n], p bộ xử lý;
OUTPUT : SUM = ∑ A[i];
FUNCTION S = SUM(A,n,m,p) // n,m la chi so dau tien va cuoi cung
BEGIN
IF p = 1 THEN
S = SEQUENCE_SUM(A,n,m);
END IF.
DO IN PARALLEL
S1 = SUM(A1,n,(n+m)/2,p/2);
S2 = SUM(A2,(n+m)/2,m,p/2);
END DO
S = S1 + S2;
END;
• Recursive equation: T(n) = T(n/2) + O(1) (considering p ≈ n)
• Complexity: O(logn).
• Machine PRAM EREW.
9
Example : convex hull
• The problem of determining the convex envelope of
a set of vertices in the plane.
• Input: n vertex (xk, yk) in the plane.
• Output: Set of vertices that form the smallest convex
polygon containing all the remaining vertices.

10
Parallel QuickHull
• Idea:
• Initial: Define u,v as 2
vertices with x coordinate
values being the smallest
and largest hence u, v are
in convex envelope.
• Segment (u,v) divides the
initial set S into 2 upper and
lower regions: S_upper and
S_lower
• Treating S_upper and
S_lower in parallel.

11
Parallel QuickHull
• Both upper hull and lower hull can be treated in the
same way.
• Division step:
• Select the pivot p as the point that has the longest
distance from the (p1 , p2).
• The pivot point will be on the convex envelope
points in the triangle (p, p1, p2) are eliminated.
• The remaining points are divided into 2 parts outside
the edges: (p, p1) and (p2 p).
• Recursive implementation with these two parts with
the steps as above.

12
Parallel QuickHull

• Signs:
• Pivot point p: max |(p1 – p) x (p2 – p)|;
• The vertexes are in the triangle if the total angles
from that vertex are equal to 2π.
• Angle between 2 vectors: cos(a,b) = (a x b)/(|a||b|)

13
Illustration steps

Uper and Lower Hull

Set of vertexes in S

14
Illustration steps

15
Illustration steps

16
Procedure QUICKHULL

17
Recursive in PRAM
• If we represent the sub-problem levels of a recursive
algorithm as a tree get the k-level tree (with k = 2
binary tree).
• The recursive idea in PRAM is to divide the set of
processors into groups, each of which will correspond
to a sub-tree in the tree.
• Executing in parallel with all processors, in each group
of processors, doing the work corresponding to the sub-
tree.

18
Recursive with UperHull
• Variables used:
• Each point i corresponds to 1 variable F[i]:
• Initial: F[i] = 1;
• Eliminated for being in: F[i] = 0;
• Marked as the point in the convex envelope: F[i] = 2.
• Since each vertex set will be assigned to two bottom
points, each vertex determines the current 2 bottom
points through the variable: P[i] và Q[i].
• Recursive steps:
• All processors perform in parallel to identify the T[i].
• Update peaks P[i] and Q[i]:
• Left vertexes (P[i],T[i]) assigned Q[i] = T[i].
• Left vertexes (T[i],Q[i]) assigned P[i] = T[i].
• Update again values F[i].
• Repeat the above work until all F[i] ≠ 1.

19
8.2 Accelerated Cascading

20
Concepts of complexity
• In serial calculation, there is only one concept of
complexity = number of steps to implement the
algorithm (≈ duration): Called S(n)
• In parallel calculations there is additional concept
number of operations performed on all processors:
Called W(n).
• If Wi(n) is the number of operations performed
simultaneously at step i we have a formula:

21
Example of S(n) and W(n)
• The combined problem with n = 2k values using math
operation ⊕. Balancing tree algorithm as follows:

22
Examples of S(n) and W(n)
• Define S(n) and W(n) values according to algorithm’s
segment codes as follows:

23
Accelerated Cascading Technique
• The cost of an algorithm is the number of operations
that the system must perform.
• Some algorithms are called optimal if: W(n) = Θ(Ts(n)).
where:
• W(n): cost of parallel algorithm.
• Ts(n) : the best execution time of serial algorithm.
• Accelerated Cascading technique combines non-optimal
algorithm but faster execution time with optimal
algorithm but slow execution time.

24
Example (1)
• An array L[1..n] receives integer values from 1..k with
k = O(log2n). Determine the number of occurrences of
integers appeared in array L.
• Let R[i] be the number of occurrences of value i.
• Optimal serial algorithm Ts(n) = Θ(n) :

25
First parallel approach
• Use two-dimensional arrays: C[1..n,1..k]

• The the number of occurrences of integer i equals the total of


C[1:n,j].

26
Example with n = 8 & k = 4

27
Comments …
• The number of operations performed by the first
two parallel loops is Θ(nk) with Θ(1) step.
• Execution time in paragraph 3 according to binary
tree model: Θ(log2n).
• The real number of calculations in paragraph 3 is:
Θ(nk)
• This algorithm is not cost effective not optimal.

28
Using Accelerated Cascading
• With m=n/k, setup array Ĉ[1..m.1.k] corresponding to
array L. That is, dividing L into m sub-array of k
elements.
• Using m processor in order to scan from 1:k, i.e. each
processor performs 1 optimal serial algorithm to
determine the number of occurrentce of j ∈ 1:k.
• The number of steps executed is Θ(k) = Θ(log2n),
• The cost of execution is Θ(mk) = Θ(n).
• R[j] determined by summing each column Ĉ[1:m,j]:
• Using balancing tree algorithm: cost Θ(m)
• Total cost is Θ(mk) = Θ(n). cost optimization
29
Using Accelerated Cascading

30
Using Accelerated Cascading

31
Using Accelerated Cascading

32
Using Accelerated Cascading

33
Using Accelerated Cascading

34
Optimal algorithm

35
Example (2): find Max
• Determine Xi = max { X1, X2, ..Xn }: Xi ≥ Xj ∀ j ∈ 1:n.
• Algorithm with PRAM EREW: O(log2n) step with the cost of
O(n), using O(n) processors and balanced tree model.
• Consider the following algorithm with PRAM CRCW:

36
Find Max with PRAM CRCW

• The algorithm above has 2 parts:


• Part 1 can be done in parallel using n2 processor with
O(1) steps Costs O(n2).
• Part 2 can be done with PRAM CRCW: determine the
value M[i] needs O(1) step, cost O(n) total cost:
O(n).
• If PRAM CRCW is selected, the problem is quickly
resolved with O(1) steps with O(n2) processors, hence
the total cost is O(n2).

37
Find Max: Accelerated Cascading
• Let (W(n),T(n)) be the number of operation and time
duration of the algorithm.
• Find-Max problem:
• (1) Balance-Tree with EREW: (O(n), O(log2n)): optimal
but slow
• (2) Use CRCW with n2 processors: (O(n2), O(1)) : not
optimal but fast.
• (3) Build a new algorithms on the DLDT tree with
(O(n.log2log2n),O(log2log2n));
• Apply (1) and (3) with Accelerated Cascading
technique new algorithm : (O(n), O(log2log2n)).

38
Tree DLDT
• DLDT = Doubly Logarithmic Depth Tree.
• This is recursive tree.
k
• DLDT(n) is a tree with n leaves. (n = 2 ).
2

• With k = 0 n = 2 tree has 1 root and 2 leaves.


• With k > 0, tree is recursively constructed as
follows:
k-1
• Root has √n = 2 sub-trees.
2

• Each sub-tree has √n leaves : DLDT(√n ).


• Comment: The number of leaves in the tree with the
root node at level i equals the number of leaves in the
tree with the root node at level i+1.

39
Tree DLDT
n = 16 nút

• Degree of node u is the number of child nodes of u.


k k-1
• Thanks to n = 2 then the root has degree of 2 = √n
2 2
k-i-1
• Node at level i has degree of 2 2 với 0 ≤ i < k.
• Node at level k-1 has 2 child nodes
• Node at level k has 2 leaves
40
Tree DLDT

• The depth of the tree is : k+1 = log2log2n + 1.


• Let n be the number of the leave of the DLDT tree

41
Find-Max: DLDT Tree
• Comments:
• At level 0 (root) we have n processors used to determine
the Maximum value of obtained results returned from the
child node (√n) at level 1. The algorithm can be applied
using PRAM CRCW with (O(n),O(1)).
• At level 1, dividing n processors into m = √n groups.
Each group corresponds to 1 node; we determine the
Maximum value from √m child node at level 2. The
algorithm can be applied using PRAM CRCW to each
node with (O(m),O(1));
• At level i, each node corresponds to 1 group 22k-i
k-i-1
processors. Each node is the father of 22 child nodes.
The algorithm can bek-iapplied using PRAM CRCW to
each node with (O(22 ),O(1)).

42
Find-Max: DLDT Tree
• Algorithm’s idea:
• Done with k repeating steps (from level k-1 =
loglogn -1 to level 0 (root node)).
• Executing the algorithm from the bottom to up.
• At the i_th iteration step:
• Perform
k k-i
in parallel with n processors k-i
divided into:
22 -2 groups, each group contains 22 processors
(because each child node is assigned to 1
processor =>processor group’s number = child
node group’s number).
• Using algorithms with CRCW for each node, the
largest value of child nodes is stored at parent
nodes.
43
Find-Max: DLDT tree
• Performance evaluation:
• Time duration: O(k)=O(log2log2n) repeating steps.
• Cost evaluation:
k-i
• Each node in i_th iteration
-i
step performs O(2 2 )
operations or O(n )2
k-2k-i -i
• At i_th step, there is 2 2 =n 1-2 nodes total
-i
cost
-i
at each i_th iteration step: Wi(n) = O(n x 1-2

n2 ) = O(n).
• The cost of the entire algorithm is:
W(n) = k*Wi(n)= O(n.k) = O(n.log2log2n).

44
Using Accelerated Cascading
• Step 1: Using balancing tree technique with [logloglogn]
level from the bottom.

45
Using Accelerated Cascading
• Step 2. Performing the algorithm on the DLDT tree
with the number of nodes m = n/log2log2n.

46
Using Accelerated Cascading
• Step 1. Using balanced tree:
• After each step of the balancing tree, the number of nodes
decreases by 1/2 (done with log2log2log2n serial steps =>T(n)
= log2log2log2n).
• With k= log2log2log2n, after this step the remaining number
of nodes equals m = n/2k = n/log2log2n.
• Cost of step 1: W(n) = O(n).
• Step 2. Using DLDT algorithm to the m other nodes:
• T(n) = O(log2log2m) = O(log2log2n).
• W(n) = O(m x log2log2m) = O(m x log2log2n) = O(n).
• Conclusion: New algorithm (O(n), O(log2log2n)).

47
8.3 Pipeline

48
Pipeline Technique
• Widely used to accelerate the executing time of
many problems, including:
• Each main problem can be divided into several
child problems.
• These sub-problems can depend on each other in
an order of execution.
• At each moment, the processors perform an
algorithm for sub-problems of each main
problem in parallel (make sure the execution
order is constant).

49
Pipeline Mechanism
• Considering n problems: t1,t2, …,tn need to do.
• Each ti can be divided into a set of sub-
problems:{ti,1 ,ti,2 , …,ti,m } so that ti,k must be
terminated before starting ti,k+1 .
• Assuming that at each step j = 1..m, algorithm
step Aj will be done with sub-problems: t1,j ,t2,j
,…,tn,j.

50
Pipeline Mechanism

51
52
Example: matrix by vector

53
Example: multiply the matrix by the
vector
• Illustrating with: matrix 4x4 and vector 4x1.
• Step1:
• P0 received (X0 and A00);
• Calculate the product of X0 and A00, then saved to Y0 .
• Push X0 down to P1 and move X1 to the top of the range.
• Step 2:
• P0 received (X1 and A01);P1 received (X0 and A 10 );
• Calculate the corresponding products and save to Y0 and
Y1 .
• Push X0, X1 down, move X2 to the top of the range,
....repeat until Step 8.

54
Parallel Insertion Sort
• Algorithm idea:
• Values that will be
sorted go into the
range of processors
one by one.
• At each processor:
• Read the value just
received.
• Compare with
current value.
• Move a value to
the next processor.

55
Thank you
for your
attentions!

56

You might also like