Parallel Computing Unit 3 - Principles of Parallel Computing Design
Parallel Computing Unit 3 - Principles of Parallel Computing Design
Algorithm Design
Parallel Computing
• This problem is able to be solved in parallel. Each of the molecular conformations is independently
determinable. The calculation of the minimum energy conformation is also a parallelizable problem.
• Example of a problem with little-to-no parallelism: Calculation of the Fibonacci series
(0,1,1,2,3,5,8,13,21,...)
by use of the formula:
F(n) = F(n-1) + F(n-2)
• The calculation of the F(n) value uses those of both F(n-1) and F(n-2), which must be computed first.
2022 Parallel Computing | UNITEN 4
Identify Potential Areas to Parallel
• Identify the program's hotspots:
• Know where most of the real work is being done. The majority of scientific and technical programs usually
accomplish most of their work in a few places.
• Profilers and performance analysis tools can help here
• Focus on parallelizing the hotspots and ignore those sections of the program that account for little CPU usage.
• Identify bottlenecks in the program:
• Are there areas that are disproportionately slow, or cause parallelizable work to halt or be deferred? For
example, I/O is usually something that slows a program down.
• May be possible to restructure the program or use a different algorithm to reduce or eliminate unnecessary
slow areas
• Identify inhibitors to parallelism
• One common class of inhibitor is data dependence, as demonstrated by the Fibonacci sequence above.
• Investigate other algorithms if possible
• This may be the single most important consideration when designing a parallel application.
• Take advantage of optimized third party parallel software and highly optimized math libraries available from
leading vendors (IBM's ESSL, Intel's MKL, AMD's AMCL, etc.).
node = task
edge = control dependence
Computation of each element of output vector y is independent of other elements. Based on this, a dense
matrix-vector product can be decomposed into n tasks. The figure highlights the portion of the matrix and
vector accessed by Task 1.
Observations: While tasks share data (namely, the vector b ), they do not have any control
dependencies - i.e., no task needs to wait for the (partial) completion of any other. All tasks are of
the same size in terms of number of operations. Is this the maximum number of tasks we could
decompose this problem into?
2022 Parallel Computing | UNITEN 10
Example: Database Query Processing
• Consider the execution of the query:
MODEL = ``CIVIC'' AND YEAR = 2001 AND
(COLOR = ``GREEN'' OR COLOR = ``WHITE)
on the following database:
ID# Model Year Color Dealer Price
4523 Civic 2002 Blue MN $18,000
3476 Corolla 1999 White IL $15,000
7623 Camry 2001 Green NY $21,000
9834 Prius 2001 Green CA $18,000
6734 Civic 2001 White OR $17,000
5342 Altima 2001 Green FL $19,000
3845 Maxima 2001 Blue NY $22,000
8354 Accord 2000 Green VT $18,000
4395 Civic 2001 Red CA $17,000
7352 Civic 2002 Red WA $18,000
• The execution of the query can be divided into subtasks in various ways.
2022 Parallel Computing | UNITEN 11
Example: Database Query Processing
• Task: compute set of elements that satisfy a predicate
• task result = table of entries that satisfy the predicate
• Edge: output of one task serves as input to the next ID# Year
7623 2001
MODEL = "CIVIC" AND YEAR = ID# Model
9834 2001
ID# Color
4523 Civic 6734 2001 7623 Green
ID# Model Year Color Dealer Price 2001 AND 6734 Civic 5342 2001
ID# Color
9834 Green
4523Civic 2002 Blue MN $18,000 4395 Civic 3845 2001 3476 White 5342 Green
(COLOR = "GREEN" OR 7352 Civic 4395 2001 6734 White 8354 Green
3476Corolla 1999 White IL $15,000
COLOR = "WHITE")
7623Camry 2001 Green NY $21,000
Different decompositions may yield different parallelism and different amounts of work
2022 Parallel Computing | UNITEN 13
Granularity of Task Decompositions
• Granularity = task size
• The amount of work associated with
parallel tasks between synchronization/
communication pointS
• Fine-grain = decompose to large
number of tasks
• Coarse-grain = decompose to small
number of tasks
• Granularity examples for dense
matrix-vector multiply
• fine-grain: each task represents an
individual element in y
• coarser-grain: each task computes 3
elements in y
Task 5 10 Task 6 6
Task 7 8
• Start node: Task 1, Task 2, Task 3, Task 4 • Average degree of concurrency = (10+10+10+10+10+6+8)/28
= 64/28 = 2.29
• Finish node: Task 7
• Maximum degree of concurrency = 4 (Task 1, Task 2, Task 3,
• Critical path: Task 1, Task 5, Task 7 and Task 2, Task 5, Task 7 Task 4)
• Critical path length: 10+10+8 = 28
2022 Parallel Computing | UNITEN 19
MODEL = "CIVIC" AND YEAR = 2001 AND
Critical Path Length (COLOR = "GREEN" OR COLOR = "WHITE")
Task 5 6
Task 6 12
Task 7 8
• Start node: Task 1, Task 2, Task 3, Task 4 • Average degree of concurrency = (10+10+10+10+6+12+8)/36 = 66/36 =
1.83
• Finish node: Task 7
• Maximum degree of concurrency = 4 (Task 1, Task 2, Task 3, Task 4)
• Critical path: Task 3, Task 5, Task 6, Task 7 and Task 3, Task 5, Task 6, Task 7
• Critical path length: 10+6+12+8 = 36
Task 3 8
𝐴1,1 × 𝑏1,1 + 𝐴1,2 × 𝑏1,2
𝑦=
𝐴2,1 × 𝑏2,1 + 𝐴2,2 × 𝑏2,2
• Start node: Task 1, Task 2 • Average degree of concurrency = (4+4+8)/12 = 16/12 = 1.33
• Finish node: Task 3 • Maximum degree of concurrency = 2 (Task 1, Task 2)
• Critical path: Task 1, Task 3 and Task 2, Task 3
• Critical path length: 4+8 = 12
2022 Parallel Computing | UNITEN 21
Limits on Parallel Performance
• It would appear that the parallel time can be made arbitrarily small by
making the decomposition finer in granularity.
• There is an inherent bound on how fine the granularity of a computation
can be.
• For example, in the case of multiplying a dense matrix with a vector, there can be no
more than (n2) concurrent tasks.
• Concurrent tasks may also have to exchange data with other tasks. This
results in communication overhead.
• The tradeoff between the granularity of a decomposition and associated
overheads often determines performance bounds.
• Fraction of application work that can’t be parallelized –as shown by
Amdahl’s law, also limits parallel performance
• sum local count vectors for item sets to produce total count vector
• Other task
decompositions
possible
Hybrid decomposition for finding the minimum of an array of size 16 using four tasks
2022 Parallel Computing | UNITEN 58
Process and Mapping
• For this reason, a parallel algorithm must also provide a mapping of tasks
to processes.
Note: These criteria often conflict with each other. For example, a decomposition
into one task (or no decomposition at all) minimizes interaction but does not
result in a speedup at all! Can you think of other such conflicting cases?
Task 5 6 P0
Task 5 10 P0 Task 6 6 P1
Task 6 12 P0
Task 7 8 P0 Task 7 8 P0
Data sharing needed for matrix multiplication with (a) one-dimensional and (b) two-dimensional partitioning
of the output matrix. Shaded portions of the input matrices A and B are required by the process that
computes the shaded portion of the output matrix C.
• Graph Partitioning