0% found this document useful (0 votes)

76 views

Parallel Computing Unit 3 - Principles of Parallel Computing Design

The document discusses principles of parallel algorithm design. It outlines the typical steps for constructing a parallel algorithm which include identifying concurrent work, partitioning work among processors, distributing data, and coordinating shared resources. The document provides examples of problems that can and cannot be easily parallelized. It also defines key terms like decomposition and task graphs. Specific techniques like task decomposition and mapping are explained through examples like matrix multiplication and database queries.

Uploaded by

Harveen Velan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

76 views

Parallel Computing Unit 3 - Principles of Parallel Computing Design

Uploaded by

Harveen Velan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 78

Unit 3 – Principles of Parallel

Algorithm Design
Parallel Computing

2022 Parallel Computing | UNITEN 1

Objectives
• To identify the suitable problem for parallel solutions
• To differentiates the decomposition techniques in parallel computing
• To differentiates the mapping techniques in parallel computing
• To differentiates the parallel algorithm design models

2022 Parallel Computing | UNITEN 2

Parallel Algorithm
• Recipe to solve a problem using multiple processors
• Typical steps for constructing a parallel algorithm
• identify what pieces of work can be performed concurrently
• partition concurrent work onto independent processors
• distribute a program’s input, output, and intermediate data
• coordinate accesses to shared data: avoid conflicts
• ensure proper order of work using synchronization
• Why “typical”? Some of the steps may be omitted.
• if data is in shared memory, distributing it may be unnecessary
• if using message passing, there may not be shared data
• the mapping of work to processors can be done statically by the programmer or
dynamically by the runtime

2022 Parallel Computing | UNITEN 3

First step :Understand the Problem and the
Program
• Undoubtedly, the first step in developing parallel software is to first understand the problem that you wish
to solve in parallel. If you are starting with a serial program, this necessitates understanding the existing
code also.
• Before spending time in an attempt to develop a parallel solution for a problem, determine whether or not
the problem is one that can actually be parallelized.
• Example of an easy to parallelize problem: Calculate the potential energy for each of
several thousand independent
conformations of a molecule. When done,
find the minimum energy conformation.

• This problem is able to be solved in parallel. Each of the molecular conformations is independently
determinable. The calculation of the minimum energy conformation is also a parallelizable problem.
• Example of a problem with little-to-no parallelism: Calculation of the Fibonacci series
(0,1,1,2,3,5,8,13,21,...)
by use of the formula:
F(n) = F(n-1) + F(n-2)

• The calculation of the F(n) value uses those of both F(n-1) and F(n-2), which must be computed first.
2022 Parallel Computing | UNITEN 4
Identify Potential Areas to Parallel
• Identify the program's hotspots:
• Know where most of the real work is being done. The majority of scientific and technical programs usually
accomplish most of their work in a few places.
• Profilers and performance analysis tools can help here
• Focus on parallelizing the hotspots and ignore those sections of the program that account for little CPU usage.
• Identify bottlenecks in the program:
• Are there areas that are disproportionately slow, or cause parallelizable work to halt or be deferred? For
example, I/O is usually something that slows a program down.
• May be possible to restructure the program or use a different algorithm to reduce or eliminate unnecessary
slow areas
• Identify inhibitors to parallelism
• One common class of inhibitor is data dependence, as demonstrated by the Fibonacci sequence above.
• Investigate other algorithms if possible
• This may be the single most important consideration when designing a parallel application.
• Take advantage of optimized third party parallel software and highly optimized math libraries available from
leading vendors (IBM's ESSL, Intel's MKL, AMD's AMCL, etc.).

2022 Parallel Computing | UNITEN 5

Terms & Definitions
• The two key steps in the design of parallel algorithms:
• Dividing a computation into smaller computations and
• assigning them to different processors for parallel execution
• The process of dividing a computation into smaller parts, some or all of
which may potentially be executed in parallel, is called decomposition.
• Tasks are programmer-defined units of computation into which the main
computation is subdivided by means of decomposition.
• Simultaneous execution of multiple tasks is the key to reducing the time
required to solve the entire problem.
• The tasks into which a problem is decomposed may not all be of the same
size.

2022 Parallel Computing | UNITEN 6

Decomposing Problems to Tasks
• Divide work into tasks that can be executed concurrently
• Many different decompositions possible for any computation
• Tasks may be same, different, or even indeterminate sizes
• Tasks may be independent or have non-trivial order
• Conceptualize tasks and ordering as task dependency directed graph

node = task
edge = control dependence

2022 Parallel Computing | UNITEN 7

Graphs for Task Decomposition
• A decomposition can be illustrated in the form of a directed graph
with nodes corresponding to tasks and edges indicating that the
result of one task is required for processing the next. Such a graph is
called a task dependency graph.
• The graph of tasks (nodes) and their interactions/data exchange
(edges) is referred to as a task interaction graph.
• Note that task interaction graphs represent data dependencies,
whereas task dependency graphs represent control dependencies.

2022 Parallel Computing | UNITEN 8

2022 Parallel Computing | UNITEN 9
Example: Multiplying a *Dense Matrix with a
Vector
* Dense matrix – if all or most of
the elements are nonzero

Computation of each element of output vector y is independent of other elements. Based on this, a dense
matrix-vector product can be decomposed into n tasks. The figure highlights the portion of the matrix and
vector accessed by Task 1.

Observations: While tasks share data (namely, the vector b ), they do not have any control
dependencies - i.e., no task needs to wait for the (partial) completion of any other. All tasks are of
the same size in terms of number of operations. Is this the maximum number of tasks we could
decompose this problem into?
2022 Parallel Computing | UNITEN 10
Example: Database Query Processing
• Consider the execution of the query:
MODEL = ``CIVIC'' AND YEAR = 2001 AND
(COLOR = ``GREEN'' OR COLOR = ``WHITE)
on the following database:
ID# Model Year Color Dealer Price
4523 Civic 2002 Blue MN $18,000
3476 Corolla 1999 White IL $15,000
7623 Camry 2001 Green NY $21,000
9834 Prius 2001 Green CA $18,000
6734 Civic 2001 White OR $17,000
5342 Altima 2001 Green FL $19,000
3845 Maxima 2001 Blue NY $22,000
8354 Accord 2000 Green VT $18,000
4395 Civic 2001 Red CA $17,000
7352 Civic 2002 Red WA $18,000

• The execution of the query can be divided into subtasks in various ways.
2022 Parallel Computing | UNITEN 11
Example: Database Query Processing
• Task: compute set of elements that satisfy a predicate
• task result = table of entries that satisfy the predicate
• Edge: output of one task serves as input to the next ID# Year
7623 2001
MODEL = "CIVIC" AND YEAR = ID# Model
9834 2001
ID# Color
4523 Civic 6734 2001 7623 Green
ID# Model Year Color Dealer Price 2001 AND 6734 Civic 5342 2001
ID# Color
9834 Green
4523Civic 2002 Blue MN $18,000 4395 Civic 3845 2001 3476 White 5342 Green
(COLOR = "GREEN" OR 7352 Civic 4395 2001 6734 White 8354 Green
3476Corolla 1999 White IL $15,000
COLOR = "WHITE")
7623Camry 2001 Green NY $21,000

9834Prius 2001 Green CA $18,000 ID# Color

6734Civic 2001 White OR $17,000 ID# Model Year 3476 White
6734 White
5342Altima 2001 Green FL $19,000 6734 Civic 2001
7623 Green
4395 Civic 2001
3845Maxima 2001 Blue NY $22,000 9834 Green
5342 Green
8354Accord 2000 Green VT $18,000 8354 Green

4395Civic 2001 Red CA $17,000

7352Civic 2002 Red WA $18,000 ID# Model Year Color
6734 Civic 2001 White

2022 Parallel Computing | UNITEN 12

Example: Database Query Processing
• Alternate task decomposition for query
MODEL = "CIVIC" AND YEAR = 2001 AND (COLOR = "GREEN" OR COLOR = "WHITE")
ID# Year
ID# Model 7623 2001 ID# Color
ID# Model Year Color Dealer Price
4523 Civic 9834 2001 7623 Green
4523Civic 2002 Blue MN $18,000 6734 Civic 6734 2001 ID# Color 9834 Green
5342 2001
4395 Civic 3476 White 5342 Green
3476Corolla 1999 White IL $15,000 3845 2001
7352 Civic 6734 White 8354 Green
4395 2001
7623Camry 2001 Green NY $21,000
9834Prius 2001 Green CA $18,000 ID# Color
6734Civic 2001 White OR $17,000 3476 White
6734 White
5342Altima 2001 Green FL $19,000 7623 Green
3845Maxima 2001 Blue NY $22,000 9834 Green
5342 Green
8354Accord 2000 Green VT $18,000 ID# Year Color 8354 Green
7623 2001 Green
4395Civic 2001 Red CA $17,000 9834 2001 Green
6734 2001 White
7352Civic 2002 Red WA $18,000
5342 2001 Green
ID# Model Year Color
6734 Civic 2001 White

Different decompositions may yield different parallelism and different amounts of work
2022 Parallel Computing | UNITEN 13
Granularity of Task Decompositions
• Granularity = task size
• The amount of work associated with
parallel tasks between synchronization/
communication pointS
• Fine-grain = decompose to large
number of tasks
• Coarse-grain = decompose to small
number of tasks
• Granularity examples for dense
matrix-vector multiply
• fine-grain: each task represents an
individual element in y
• coarser-grain: each task computes 3
elements in y

2022 Parallel Computing | UNITEN 14

Degree of Concurrency
• The number of tasks that can be executed in parallel is the degree of
concurrency of a decomposition.
• Since the number of tasks that can be executed in parallel may change over
program execution, the maximum degree of concurrency is the maximum
number of such tasks at any point during execution.
• What is the maximum degree of concurrency of the database query examples?
• The average degree of concurrency is the average number of tasks that can
be processed in parallel over the execution of the program.
• Assuming that each tasks in the database example takes identical processing time,
what is the average degree of concurrency in each decomposition?
• The degree of concurrency increases as the decomposition becomes finer
in granularity and vice versa. Degree of concurrency vs. task granularity is
an inverse relationship.

2022 Parallel Computing | UNITEN 15

Degrees of concurrency
• Maximum degree of concurrency
• the maximum number of tasks that can be executed in parallel
• Average degree of concurrency
• the average number of tasks that can be executed in parallel
• more useful measure

2022 Parallel Computing | UNITEN 16

Critical Path Length
• A directed path in the task dependency graph represents a sequence
of tasks that must be processed one after the other.
• The longest such path determines the shortest time in which the
program can be executed in parallel.
• The length of the longest path in a task dependency graph is called
the critical path length.

2022 Parallel Computing | UNITEN 17

Critical Path
• Start nodes: nodes with no incoming edges
• Finish nodes: nodes with no outgoing edges
• Critical path: the longest directed path between any pair of start and
finish nodes
• Critical path length: sum of the weights of the nodes on a critical path
• Average degree of concurrency = total amount of work
critical path length

2022 Parallel Computing | UNITEN 18

MODEL = "CIVIC" AND YEAR = 2001 AND

Critical Path Length (COLOR = "GREEN" OR COLOR = "WHITE")

Decomposition → Task Dependency Graph

Task 1 Task 2 Task 3 Task 4
10 10 10 10

Task 5 10 Task 6 6

Task 7 8

• Start node: Task 1, Task 2, Task 3, Task 4 • Average degree of concurrency = (10+10+10+10+10+6+8)/28
= 64/28 = 2.29
• Finish node: Task 7
• Maximum degree of concurrency = 4 (Task 1, Task 2, Task 3,
• Critical path: Task 1, Task 5, Task 7 and Task 2, Task 5, Task 7 Task 4)
• Critical path length: 10+10+8 = 28
2022 Parallel Computing | UNITEN 19
MODEL = "CIVIC" AND YEAR = 2001 AND
Critical Path Length (COLOR = "GREEN" OR COLOR = "WHITE")

Decomposition → Task Dependency Graph

Task 1 Task 2 Task 3 Task 4
10 10 10 10

Task 5 6

Task 6 12

Task 7 8

• Start node: Task 1, Task 2, Task 3, Task 4 • Average degree of concurrency = (10+10+10+10+6+12+8)/36 = 66/36 =
1.83
• Finish node: Task 7
• Maximum degree of concurrency = 4 (Task 1, Task 2, Task 3, Task 4)
• Critical path: Task 3, Task 5, Task 6, Task 7 and Task 3, Task 5, Task 6, Task 7
• Critical path length: 10+6+12+8 = 36

2022 Parallel Computing | UNITEN 20

y=Axb
Critical Path Length
Decomposition → Task Dependency Graph
Task 1 Task 2
A b y
4 4
Task 1
Task 2

Task 3 8
𝐴1,1 × 𝑏1,1 + 𝐴1,2 × 𝑏1,2
𝑦=
𝐴2,1 × 𝑏2,1 + 𝐴2,2 × 𝑏2,2

• Start node: Task 1, Task 2 • Average degree of concurrency = (4+4+8)/12 = 16/12 = 1.33
• Finish node: Task 3 • Maximum degree of concurrency = 2 (Task 1, Task 2)
• Critical path: Task 1, Task 3 and Task 2, Task 3
• Critical path length: 4+8 = 12
2022 Parallel Computing | UNITEN 21
Limits on Parallel Performance
• It would appear that the parallel time can be made arbitrarily small by
making the decomposition finer in granularity.
• There is an inherent bound on how fine the granularity of a computation
can be.
• For example, in the case of multiplying a dense matrix with a vector, there can be no
more than (n2) concurrent tasks.
• Concurrent tasks may also have to exchange data with other tasks. This
results in communication overhead.
• The tradeoff between the granularity of a decomposition and associated
overheads often determines performance bounds.
• Fraction of application work that can’t be parallelized –as shown by
Amdahl’s law, also limits parallel performance

2022 Parallel Computing | UNITEN 22

Task Interaction Graphs
• Tasks generally exchange data with others
• example: dense matrix-vector multiply
• if vector b is not replicated in all tasks, tasks will have to communicate
elements of b
• The graph of tasks (nodes) and their interactions/data exchange
(edges) is referred to as a task interaction graph.
• Note that task interaction graphs represent data dependencies,
whereas task dependency graphs represent control dependencies.
• In task interaction graph
• node = task
• edge = interaction or data exchange

2022 Parallel Computing | UNITEN 23

Task Interaction Graph Example
Sparse matrix-vector multiplication
• Computation of each result
element = independent task
• Only non-zero elements of
sparse matrix A participate
• If, b is partitioned among tasks …
• structure of the task interaction
graph = graph of the matrix A
• (i.e. the graph for which A
represents the adjacency
structure)

2022 Parallel Computing | UNITEN 24

Task Interaction Graphs, Granularity, and
Communication
• In general, if the granularity of a decomposition is finer, the associated
overhead (as a ratio of useful work associated with a task) increases.
• Example: Consider the sparse matrix-vector product example from
previous slide. Assume that each node takes one unit time to process and
each interaction (edge) causes an overhead of one unit time.
• Viewing node 0 as an independent task involves a useful computation of
one time unit and overhead (communication) of three time units.
• Now, if we consider nodes 0, 4, and 5 as one task, then the task has useful
computation totaling to three time units and communication
corresponding to four time units (four edges). Clearly, this is a more
favorable ratio than the former case.

2022 Parallel Computing | UNITEN 25

Interaction Graphs, Granularity, &
Communication
• Finer task granularity increases communication overhead
• Example: sparse matrix-vector product interaction graph
• Assumptions:
• each node takes unit time to process
• each interaction (edge) causes an overhead of a unit time
• If node 0 is a task: Communication
• communication = 3
Computation
• computation = 4
• 0-0,0-1,0-4,0-8
• If nodes 0, 4, and 5 are a task:
• communication = 5;
• computation = 15
• 0-0,0-1,0-4,0-8 = 4
• 4-4,4-0,4-5,4-9,4-8 = 5
• 5-5,5-1,5-2,5-6,5-9,5-4 = 6
• coarser-grain decomposition → smaller
communication/computation
Communication Computation

2022 Parallel Computing | UNITEN 26

Decomposition
techniques

2022 Parallel Computing | UNITEN 27

Decomposition Techniques
How should one decompose a task into various subtasks?
• No single universal recipe
• In practice, a variety of techniques are used including
• recursive decomposition
• data decomposition
• exploratory decomposition
• speculative decomposition

2022 Parallel Computing | UNITEN 28

Decomposition Techniques
Recursive Decomposition
• Suitable for problems solvable
using divide-and-conquer
• Steps
• decompose a problem into a set of
sub-problems
• recursively decompose each sub-
problem
• stop decomposition when
minimum desired granularity
reached

2022 Parallel Computing | UNITEN 29

Decomposition Techniques
Recursive Decomposition for Quicksort
• A classic example of a divide-
and-conquer algorithm on which
we can apply recursive
decomposition is Quicksort.
• In this example, once the list has
been partitioned around the
pivot, each sublist can be
processed concurrently (i.e.,
each sublist represents an
independent subtask). This can
be repeated recursively.

2022 Parallel Computing | UNITEN 30

Decomposition Techniques
Recursive Decomposition for Quicksort
• Sort a vector v:
• Choose the first index
• Select a pivot
• Partition v around pivot into vleft
and vright
• In parallel, sort vleft and sort
vright

2022 Parallel Computing | UNITEN 31

Decomposition Techniques
Recursive Decomposition to find minimum number
• The problem of finding the 1. procedure SERIAL_MIN (A,n)
minimum number in a given 2. begin
list (or indeed any other
3. min = A[0];
associative operation such 4. for i:= 1 to n−1 do
as sum, AND, etc.) can be
fashioned as a divide-and- 5. if (A[i] <min) min := A[i];
conquer algorithm. The 6. endfor;
following algorithm illustrates 7. return min;
this. 8. end SERIAL_MIN
• We first start with a simple
serial loop for computing the
minimum entry in a given list:

2022 Parallel Computing | UNITEN 32

Decomposition Techniques
Recursive Decomposition to find minimum number
• We can rewrite the 1. procedure RECURSIVE_MIN (A, n)
2. begin
loop as follows: 3. if ( n = 1 ) then
4. min := A [0] ;
5. else
6. lmin := RECURSIVE_MIN ( A, n/2 );
7. rmin := RECURSIVE_MIN ( &(A[n/2]), n -
n/2 );
8. if (lmin < rmin) then
9. min := lmin;
10. else
11. min := rmin;
12. endelse;
13. endelse;
14. return min;
15. end RECURSIVE_MIN

2022 Parallel Computing | UNITEN 33

Decomposition Techniques
Recursive Decomposition to find minimum number
• The code in the previous slide can be decomposed naturally using a
recursive decomposition strategy. We illustrate this with the following
example of finding the minimum number in the set {4, 9, 1, 7, 8, 11, 2,
12}. The task dependency graph associated with this computation is as
follows:

2022 Parallel Computing | UNITEN 34

Decomposition Techniques
Data Decomposition
• Steps
1. identify the data on which computations are performed
2. partition the data across various tasks
• partitioning induces a decomposition of the problem
• Data can be partitioned in various ways
• appropriate partitioning is critical to parallel performance
• Decomposition based on
• input data
• output data
• input + output data
• intermediate data

2022 Parallel Computing | UNITEN 35

Decomposition Techniques
Data Decomposition: Based on Input Data
• Applicable if each output is computed as a function of the input
• May be the only natural decomposition if output is unknown
• examples
• finding the minimum in a set or other reductions
• sorting a vector
• Associate a task with each input data partition
• task performs computation on its part of the data
• subsequent processing combines partial results from earlier tasks

2022 Parallel Computing | UNITEN 36

Decomposition Techniques
Data Decomposition: Based on
Input Data
• The problem of computing the
frequency of a set of itemsets in a
transaction database
• Decomposition based on a
partitioning of the input set of
transactions.
• Each of the two tasks computes the
frequencies of all the itemsets in its
respective subset of transactions.
• The two sets of frequencies, which
are the independent outputs of the
two tasks, represent intermediate
results.
• Combining the intermediate results
by pairwise addition yields the final
result.

2022 Parallel Computing | UNITEN 37

Decomposition Techniques
Data Decomposition: Based on Input Data
• Count the frequency of item sets in database transactions

• Partition computation by partitioning the set of transactions

• a task computes a local count for each item set for its transactions

• sum local count vectors for item sets to produce total count vector

2022 Parallel Computing | UNITEN 38

Decomposition Techniques
Data Decomposition: Based on Output Data
• Often, each element of the output can be computed independently of others (but
simply as a function of the input).
• Partition the output data across tasks
• Have each task perform the computation for its outputs

2022 Parallel Computing | UNITEN 39

Decomposition Techniques
Data Decomposition: Based on Output Data -
Example
• Matrix multiplication: C = A x B
• Computation of C can be partitioned into four tasks

• Other task
decompositions
possible

2022 Parallel Computing | UNITEN 40

Decomposition Techniques
Data Decomposition: Based on Output Data -
Example
Decomposition I Decomposition II
• A partitioning of output data does Task 1: C1,1 = A1,1 B1,1 Task 1: C1,1 = A1,1 B1,1
not result in a unique Task 2: C1,1 = C1,1 + A1,2 B2,1 Task 2: C1,1 = C1,1 + A1,2 B2,1
decomposition into tasks. For Task 3: C1,2 = A1,1 B1,2 Task 3: C1,2 = A1,2 B2,2
example, for the same problem as Task 4: C1,2 = C1,2 + A1,2 B2,2 Task 4: C1,2 = C1,2 + A1,1 B1,2
in previous foil, with identical Task 5: C2,1 = A2,1 B1,1 Task 5: C2,1 = A2,2 B2,1
output data distribution, we can Task 6: C2,1 = C2,1 + A2,2 B2,1 Task 6: C2,1 = C2,1 + A2,1 B1,1
derive the following two (other) Task 7: C2,2 = A2,1 B1,2 Task 7: C2,2 = A2,1 B1,2
decompositions: Task 8: C2,2 = C2,2 + A2,2 B2,2 Task 8: C2,2 = C2,2 + A2,2 B2,2

2022 Parallel Computing | UNITEN 41

Decomposition Techniques
Data Decomposition: Based on Output Data -
Example
• Count the frequency of item sets
in database transactions
• Partition computation by
partitioning the item sets to
count
• each task computes total count for
each of its item sets
• append total counts for item
subsets to produce result

2022 Parallel Computing | UNITEN 42

Decomposition Techniques
Data Decomposition: Based on Output Data -
Example
From the previous example, the following observations can be made:

• If the database of transactions is replicated across the processes, each

task can be independently accomplished with no communication.
• If the database is partitioned across processes as well (for reasons of
memory utilization), each task first computes partial counts. These
counts are then aggregated at the appropriate task.

2022 Parallel Computing | UNITEN 43

Decomposition Techniques
Data Decomposition: Partitioning Input and
Output Data
• Partition on both input and
output for more concurrency
• Example: itemset counting

2022 Parallel Computing | UNITEN 44

Decomposition Techniques
Data Decomposition: Intermediate Data
Partitioning
• If computation is a sequence of transforms
• (from input data to output data)
• Can decompose based on data for intermediate stages

2022 Parallel Computing | UNITEN 45

Decomposition Techniques
Data Decomposition: Intermediate Data Partitioning:
Example
• Let us revisit the example of
dense matrix multiplication. We
first show how we can visualize
this computation in terms of
intermediate matrices D.

2022 Parallel Computing | UNITEN 46

Decomposition Techniques
Data Decomposition: Intermediate Data Partitioning:
Example Stage I
• A decomposition of
intermediate data
structure leads to
the following
decomposition into Stage II
8 + 4 tasks:
Task 01: D1,1,1= A1,1 B1,1 Task 02: D2,1,1= A1,2 B2,1
Task 03: D1,1,2= A1,1 B1,2 Task 04: D2,1,2= A1,2 B2,2
Task 05: D1,2,1= A2,1 B1,1 Task 06: D2,2,1= A2,2 B2,1
Task 07: D1,2,2= A2,1 B1,2 Task 08: D2,2,2= A2,2 B2,2
Task 09: C1,1 = D1,1,1 + D2,1,1 Task 10: C1,2 = D1,1,2 + D2,1,2
2022 Task 11: Parallel
C2,1 = D1,2,1 |+UNITEN
Computing D2,2,1 Task 12: C2,,2 = D1,2,2 + D2,2,2 47
Decomposition Techniques
Data Decomposition: Intermediate Data
Partitioning: Example

2022 Parallel Computing | UNITEN 48

Decomposition Techniques
Data Decomposition: Intermediate Data
Partitioning - Owner Computes Rule
• Each piece of information is assigned to a thread
• Each thread computes values associated with its data implications
• input data decomposition
• all computations using an input datum are performed by its thread
• output data decomposition
• an output is computed by the thread assigned to the output data

2022 Parallel Computing | UNITEN 49

Decomposition Techniques
Exploratory Decomposition
• Is used to decompose problems whose underlying computations
correspond to a search of a space for solutions.
• Exploration (search) of a state space of solutions
• problem decomposition reflects shape of execution
• Examples
• discrete optimization
• theorem proving
• game playing

2022 Parallel Computing | UNITEN 50

Decomposition Techniques
Exploratory Decomposition: Example
Solving a 15 puzzle
• Sequence of three moves from state (a) to final state
(d)

• From an arbitrary state, must search for a solution

2022 Parallel Computing | UNITEN 51
Decomposition Techniques
Exploratory Decomposition: Example
Search
— generate successor states of the current state
— explore each as an independent task

2022 Parallel Computing | UNITEN 52

Decomposition Techniques
Exploratory Decomposition: Speedup
• Parallel formulation may perform a different amount of work
• It can be either smaller or greater than serial

Total serial work: 2m + 1 Total serial work: m

Total parallel work : 1 Total parallel work : 4m

2022 Parallel Computing | UNITEN 53

Decomposition Techniques
Speculative Decomposition
• In some applications, dependencies between tasks are not known a-
priori.
• For such applications, it is impossible to identify independent tasks.
• There are generally two approaches to dealing with such applications:
• conservative approaches - identify independent tasks only when they are
guaranteed to not have dependencies
• optimistic approaches - schedule tasks even when they may potentially be
erroneous.
• Conservative approaches may yield little concurrency and optimistic
approaches may require roll-back mechanism in the case of an error.

2022 Parallel Computing | UNITEN 54

Decomposition Techniques
Speculative Decomposition : Example
A classic example of speculative decomposition is in discrete event simulation.
• The central data structure in a discrete event simulation is a time-ordered event
list.
• Events are extracted precisely in time order, processed, and if required, resulting
events are inserted back into the event list.
• Consider your day today as a discrete event system - you get up, get ready, drive
to work, work, eat lunch, work some more, drive back, eat dinner, and sleep.
• Each of these events may be processed independently, however, in driving to
work, you might meet with an unfortunate accident and not get to work at all.
• Therefore, an optimistic scheduling of other events will have to be rolled back.

2022 Parallel Computing | UNITEN 55

Decomposition Techniques
Speculative Decomposition : Example
• Another example is the simulation
of a network of nodes (for
instance, an assembly line or a
computer network through which
packets pass). The task is to
simulate the behavior of this
network for various inputs and
node delay parameters (note that
networks may become unstable
for certain values of service rates,
queue sizes, etc.).

2022 Parallel Computing | UNITEN 56

Decomposition Techniques
Exploratory vs Speculative
• Speculative decomposition is different from exploratory decomposition in the
following way.
• In speculative decomposition, the input at a branch leading to multiple parallel tasks is
unknown,
• whereas in exploratory decomposition, the output of the multiple tasks originating at a
branch is unknown.
• A parallel program employing speculative decomposition may performs more
aggregate work than its serial counterpart.
• On the other hand, in exploratory decomposition, the serial algorithm too may
explore different alternatives one after the other, because the branch that may
lead to the solution is not known beforehand.
• Therefore, the parallel program may perform more, less, or the same amount of
aggregate work compared to the serial algorithm depending on the location of
the solution in the search space.

2022 Parallel Computing | UNITEN 57

Decomposition Techniques
Hybrid Decompositions
Often, a mix of decomposition techniques is necessary for decomposing a problem. Consider the
following examples:
• In quicksort, recursive decomposition alone limits concurrency (Why?). A mix of data and recursive
decompositions is more desirable.
• In discrete event simulation, there might be concurrency in task processing. A mix of speculative
decomposition and data decomposition may work well.
• Even for simple problems like finding a minimum of a list of numbers, a mix of data and recursive
decomposition works well.

Hybrid decomposition for finding the minimum of an array of size 16 using four tasks
2022 Parallel Computing | UNITEN 58
Process and Mapping

2022 Parallel Computing | UNITEN 59

Processes and Mapping
• In general, the number of tasks in a decomposition exceeds the number of
processing elements available.

• For this reason, a parallel algorithm must also provide a mapping of tasks
to processes.

• Note: We refer to the mapping as being from tasks to processes, as

opposed to processors. This is because typical programming APIs, as we
shall see, do not allow easy binding of tasks to physical processors. Rather,
we aggregate tasks into processes and rely on the system to map these
processes to physical processors. We use processes, not in the UNIX sense
of a process, rather, simply as a collection of tasks and associated data.

2022 Parallel Computing | UNITEN 60

Processes and Mapping
• Appropriate mapping of tasks to processes is critical to the parallel
performance of an algorithm.
• Mappings are determined by both the task dependency and task
interaction graphs.
• Task dependency graphs can be used to ensure that work is equally
spread across all processes at any point (minimum idling and optimal
load balance).
• Task interaction graphs can be used to make sure that processes need
minimum interaction with other processes (minimum
communication).

2022 Parallel Computing | UNITEN 61

Processes and Mapping
An appropriate mapping must minimize parallel execution time by:

• Mapping independent tasks to different processes.

• Assigning tasks on critical path to processes as soon as they become available.

• Minimizing interaction between processes by mapping tasks with dense

interactions to the same process.

Note: These criteria often conflict with each other. For example, a decomposition
into one task (or no decomposition at all) minimizes interaction but does not
result in a speedup at all! Can you think of other such conflicting cases?

2022 Parallel Computing | UNITEN 62

Processes and Mapping: Example
Task 1 Task 2 Task 3 Task 4 Task 1 Task 2 Task 3 Task 4
10 P0 10 P1 10 P2 10 P3 10 P0 10 P1 10 P2 10 P3

Task 5 6 P0

Task 5 10 P0 Task 6 6 P1
Task 6 12 P0

Task 7 8 P0 Task 7 8 P0

Mapping tasks in the database query decomposition

to processes. These mappings were arrived at by
viewing the dependency graph in terms of levels (no
two nodes in a level have dependencies). Tasks within
a single level are then assigned to different processes.
2022 Parallel Computing | UNITEN 63
Mapping
• Finding a balance that optimizes the overall parallel performance is
the key to a successful parallel algorithm.
• Therefore, mapping of tasks onto processes plays an important role in
determining how efficient the resulting parallel algorithm is.
• Even though the degree of concurrency is determined by the
decomposition, it is the mapping that determines how much of that
concurrency is actually utilized, and how efficiently.

2022 Parallel Computing | UNITEN 64

Mapping Techniques
• Once a problem has been decomposed into concurrent tasks, these
must be mapped to processes (that can be executed on a parallel
platform).
• Mappings must minimize overheads.
• Primary overheads are communication and idling.
• Minimizing these overheads often represents contradicting objectives
: minimizing one increases the other
• Assigning all work to one processor trivially minimizes communication
at the expense of significant idling. minimizing serialization introduces
communication

2022 Parallel Computing | UNITEN 65

Mapping Techniques for Minimum Idling
• Must simultaneously minimize idling and load balance
• Balancing load alone does not minimize idling

• Same work load on each processor but different completion time

2022 Parallel Computing | UNITEN 66

Mapping Techniques for Minimum Idling
• Mapping techniques can be static or dynamic
• Static Mapping: Tasks are mapped to processes a-priori. For this to work, we
must have a good estimate of the size of each task. Even in these cases, even
so, computing an optimal mapping may still be hard
• Dynamic Mapping: Tasks are mapped to processes at runtime. This may be
because the tasks are generated at runtime, or that their sizes are not known.
• Other factors that determine the choice of techniques include the
size of data associated with a task and the nature of underlying
domain.

2022 Parallel Computing | UNITEN 67

Schemes for Static Mapping
• Data partitioning
• Task graph partitioning
• Hybrid strategies

2022 Parallel Computing | UNITEN 68

Mappings Based on Data Partitioning
• Partition computation using a combination of
• data partitioning
• owner-computes rule
• The simplest data decomposition schemes for dense matrices are 1-D
block distribution schemes.

2022 Parallel Computing | UNITEN 69

Block Array Distribution Schemes
• Multi-dimensional block distributions

• Multi-dimensional partitioning enables larger # of processes

2022 Parallel Computing | UNITEN 70

Block Array Distribution Example
• For multiplying two dense matrices C = A x B, we can partition the
output matrix C using a block decomposition.
• For load balance, we give each task the same number of elements of
C. (Note that each element of C corresponds to a single dot product.)
• The choice of precise decomposition (1-D or 2-D) is determined by
the associated communication overhead.
• In general, higher dimension decomposition allows the use of larger
number of processes.
• Select to minimize associated communication overhead

2022 Parallel Computing | UNITEN 71

Data Usage in Dense Matrix
Multiplication

Data sharing needed for matrix multiplication with (a) one-dimensional and (b) two-dimensional partitioning
of the output matrix. Shaded portions of the input matrices A and B are required by the process that
computes the shaded portion of the output matrix C.

2022 Parallel Computing | UNITEN 72

Other Mapping Techniques
• Cyclic and Block Cyclic Distributions

• Graph Partitioning

• Mappings Based on Task Partitioning

2022 Parallel Computing | UNITEN 73

Parallel Algorithm
Design Models

2022 Parallel Computing | UNITEN 74

Parallel Algorithm Models
An algorithm model is a way of structuring a parallel algorithm by
selecting a decomposition and mapping technique and applying the
appropriate strategy to minimize interactions.

• Data Parallel Model: Tasks are statically (or semi-statically) mapped

to processes and each task performs similar operations on different
data.
• Task Graph Model: Starting from a task dependency graph, the
interrelationships among the tasks are utilized to promote locality or
to reduce interaction costs.

2022 Parallel Computing | UNITEN 75

Parallel Algorithm Models (continued)
• Master-Slave Model: One or more processes generate work and
allocate it to worker processes. This allocation may be static or
dynamic.
• Pipeline / Producer-Consumer Model: A stream of data is passed
through a succession of processes, each of which perform some task
on it.
• Hybrid Models: A hybrid model may be composed either of multiple
models applied hierarchically or multiple models applied sequentially
to different phases of a parallel algorithm.

2022 Parallel Computing | UNITEN 76

References
• Adapted from slides “Principles of Parallel Algorithm Design” by
Ananth Grama
• Based on Chapter 3 of “Introduction to Parallel Computing” by
Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar.
Addison Wesley, 2003
• http://www-users.cs.umn.edu/~karypis/parbook/

2022 Parallel Computing | UNITEN 77

2022 Parallel Computing | UNITEN 78

Cortex XDR: Safeguard Your Entire Organization With The Industry's First Extended Detection and Response Platform
No ratings yet
Cortex XDR: Safeguard Your Entire Organization With The Industry's First Extended Detection and Response Platform
8 pages
Cosa - The Lamps
No ratings yet
Cosa - The Lamps
282 pages
Cisco Router 1941 Series Manual Installation and Configuration
100% (2)
Cisco Router 1941 Series Manual Installation and Configuration
116 pages
Unit 2
No ratings yet
Unit 2
64 pages
Unit 2
No ratings yet
Unit 2
151 pages
Unit 2_Part_1
No ratings yet
Unit 2_Part_1
32 pages
Chapter 7 - Parallel Programming Issues
No ratings yet
Chapter 7 - Parallel Programming Issues
68 pages
Principles of Parallel Algorithm Design
No ratings yet
Principles of Parallel Algorithm Design
63 pages
Introduction To Parallel Computing Design and Anal
No ratings yet
Introduction To Parallel Computing Design and Anal
53 pages
Hpc_unit-2 Insem Notes
No ratings yet
Hpc_unit-2 Insem Notes
99 pages
Unit - 2 HPC
No ratings yet
Unit - 2 HPC
96 pages
FoP HPC Unit II
No ratings yet
FoP HPC Unit II
107 pages
Parallel Computing
No ratings yet
Parallel Computing
91 pages
AA-Part1 (1)
No ratings yet
AA-Part1 (1)
43 pages
L19-20 PA Design Intro
No ratings yet
L19-20 PA Design Intro
31 pages
Lecture 6 Principles of Parallel Algorithm Design
No ratings yet
Lecture 6 Principles of Parallel Algorithm Design
35 pages
Parallel Computing Unit 1 - Introduction To Parallel Computing
No ratings yet
Parallel Computing Unit 1 - Introduction To Parallel Computing
43 pages
03-Task Decomposition and Mapping
No ratings yet
03-Task Decomposition and Mapping
62 pages
ConcurrencyDecomposition Parallel Algorithm
No ratings yet
ConcurrencyDecomposition Parallel Algorithm
40 pages
Parallel Programming: Lecture #9
No ratings yet
Parallel Programming: Lecture #9
24 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
04 Progbasics
No ratings yet
04 Progbasics
43 pages
Lectures5 14
No ratings yet
Lectures5 14
85 pages
Unit 2
No ratings yet
Unit 2
81 pages
HPC Ut 2
No ratings yet
HPC Ut 2
4 pages
Chapter 3 - Principles of Parallel Algorithm Design
No ratings yet
Chapter 3 - Principles of Parallel Algorithm Design
52 pages
Week1 - Parallel and Distributed Computing
100% (1)
Week1 - Parallel and Distributed Computing
46 pages
Unit 2 HPC
No ratings yet
Unit 2 HPC
92 pages
hpc_parallel
No ratings yet
hpc_parallel
122 pages
Lecture 5 Principles of Parallel Algorithm Design
No ratings yet
Lecture 5 Principles of Parallel Algorithm Design
30 pages
Common PDC Module3
No ratings yet
Common PDC Module3
43 pages
CSC 580 - Chapter 3
No ratings yet
CSC 580 - Chapter 3
35 pages
04 Progbasics
No ratings yet
04 Progbasics
62 pages
in3200-chap05
No ratings yet
in3200-chap05
34 pages
HPC Overview
No ratings yet
HPC Overview
45 pages
Project - ParallelComputing BSR v2
No ratings yet
Project - ParallelComputing BSR v2
40 pages
.Trashed-1650000204-Hpc Prac Exam
No ratings yet
.Trashed-1650000204-Hpc Prac Exam
5 pages
Lec04b-Processes and Mapping
No ratings yet
Lec04b-Processes and Mapping
26 pages
5 - Designing Parallel Programs
No ratings yet
5 - Designing Parallel Programs
52 pages
W1 Intro.4u
No ratings yet
W1 Intro.4u
7 pages
Lecture-2-06.01.2025
No ratings yet
Lecture-2-06.01.2025
21 pages
unit1 2 and 3
No ratings yet
unit1 2 and 3
76 pages
Parallel Computing Simply in Depth by Ajit Singh PDF
No ratings yet
Parallel Computing Simply in Depth by Ajit Singh PDF
125 pages
Principles of Parallel Algorithm Design
No ratings yet
Principles of Parallel Algorithm Design
78 pages
Parallel computing a comparative
No ratings yet
Parallel computing a comparative
65 pages
Parallel Computing With Matlab: Sarah Wait Zaranek Application Engineer Mathworks, Inc
No ratings yet
Parallel Computing With Matlab: Sarah Wait Zaranek Application Engineer Mathworks, Inc
44 pages
An Approach To Parallel Processing: Yashraj Rai Puja Padiya
No ratings yet
An Approach To Parallel Processing: Yashraj Rai Puja Padiya
3 pages
Simulating Ocean Currents
No ratings yet
Simulating Ocean Currents
35 pages
BCSE412L - Parallel Computing 01
No ratings yet
BCSE412L - Parallel Computing 01
27 pages
Parallel Computing MCSE011
No ratings yet
Parallel Computing MCSE011
189 pages
3.1.3 Processes and Mapping (1/5)
No ratings yet
3.1.3 Processes and Mapping (1/5)
74 pages
Parallel Processor Computing Unit 1
No ratings yet
Parallel Processor Computing Unit 1
10 pages
PDC (Steps in Parallel Algorithm Design)
No ratings yet
PDC (Steps in Parallel Algorithm Design)
82 pages
RG2-ParallelizationPrinciples-HPCAI-Jan2020
No ratings yet
RG2-ParallelizationPrinciples-HPCAI-Jan2020
40 pages
ch3 Parallel PDF
0% (1)
ch3 Parallel PDF
76 pages
Chap3 Slides Week4
No ratings yet
Chap3 Slides Week4
42 pages
HPC Lectures 1 5
No ratings yet
HPC Lectures 1 5
18 pages
High Performance Computing Unit 1-2
No ratings yet
High Performance Computing Unit 1-2
60 pages
Parallel Programming Module 1
No ratings yet
Parallel Programming Module 1
71 pages
Partitioning
No ratings yet
Partitioning
37 pages
HPC Note
No ratings yet
HPC Note
39 pages
Patterns, Principles, and Practices of Domain-Driven Design
From Everand
Patterns, Principles, and Practices of Domain-Driven Design
Scott Millett
No ratings yet
Final Report 2
No ratings yet
Final Report 2
87 pages
CSNB594 - 4423-Assignment 1 Question
No ratings yet
CSNB594 - 4423-Assignment 1 Question
4 pages
S2 20212022CSEB424 - 4313 Final Exam
No ratings yet
S2 20212022CSEB424 - 4313 Final Exam
17 pages
Csnb594csnb4423 Lab 5 01a Harveen Velan Sw0104101
No ratings yet
Csnb594csnb4423 Lab 5 01a Harveen Velan Sw0104101
19 pages
S1 20202021 CSEB424 Chapter 1
No ratings yet
S1 20202021 CSEB424 Chapter 1
56 pages
Parallel Computing Unit 4 - Pthreads
No ratings yet
Parallel Computing Unit 4 - Pthreads
51 pages
Parallel Computing Unit 2 - Parallel Computing Architecture
No ratings yet
Parallel Computing Unit 2 - Parallel Computing Architecture
49 pages
Manual de Instalare Modul de Extensie Paradox ZX1
No ratings yet
Manual de Instalare Modul de Extensie Paradox ZX1
2 pages
BBA Marketing Management II Session 1
No ratings yet
BBA Marketing Management II Session 1
57 pages
Practical No 26 - Merged
No ratings yet
Practical No 26 - Merged
18 pages
Call Flow
No ratings yet
Call Flow
4 pages
CH 8
No ratings yet
CH 8
52 pages
The Role of Artificial Intelligence in Modern Healthcare
No ratings yet
The Role of Artificial Intelligence in Modern Healthcare
2 pages
Getting Started How Sewing Machines Work
No ratings yet
Getting Started How Sewing Machines Work
4 pages
MICE FEASIBILITY-STUDY - Squidpay App Exhibition 2021
No ratings yet
MICE FEASIBILITY-STUDY - Squidpay App Exhibition 2021
7 pages
Application: 237SW12C (LNS 3.x) : LON Bus Coupling Unit UP - LON-BCU
No ratings yet
Application: 237SW12C (LNS 3.x) : LON Bus Coupling Unit UP - LON-BCU
20 pages
fuba reviewer
No ratings yet
fuba reviewer
6 pages
CRYPTOCURRENCY PPT Report
No ratings yet
CRYPTOCURRENCY PPT Report
4 pages
SPC Discover 125 M (2018 01 25) PDF
100% (2)
SPC Discover 125 M (2018 01 25) PDF
76 pages
18wh1a0558@bvrithyderabad - Edu.in: Email-Id: Mobile No: +91-6303921228
No ratings yet
18wh1a0558@bvrithyderabad - Edu.in: Email-Id: Mobile No: +91-6303921228
2 pages
HikCentral Professional V1.5
No ratings yet
HikCentral Professional V1.5
24 pages
Introduction To Logistics Management
No ratings yet
Introduction To Logistics Management
29 pages
Ferret Business Mod List
No ratings yet
Ferret Business Mod List
15 pages
SKYP1-57.5MW-TL-E-DWG-AL-01 - Stamped (Minpur Palvancha)
No ratings yet
SKYP1-57.5MW-TL-E-DWG-AL-01 - Stamped (Minpur Palvancha)
1 page
Casagrand ECR14 Brochure
No ratings yet
Casagrand ECR14 Brochure
138 pages
A Multicarrier PWM Technique for Five Level Inverter Connected to the Grid
No ratings yet
A Multicarrier PWM Technique for Five Level Inverter Connected to the Grid
11 pages
1.1 - IT233 - Mid Term Exam Questions and Answers
No ratings yet
1.1 - IT233 - Mid Term Exam Questions and Answers
4 pages
Iitb Library Thesis Submission
100% (1)
Iitb Library Thesis Submission
8 pages
23xl Slide Valve Rework
No ratings yet
23xl Slide Valve Rework
11 pages
HOW TO MAKE MONEY ONLINE IN 2024 Récupération Automatique
No ratings yet
HOW TO MAKE MONEY ONLINE IN 2024 Récupération Automatique
9 pages
Lockout Permit Form - 1
No ratings yet
Lockout Permit Form - 1
2 pages
Principles of Programming Using C Laboratory
No ratings yet
Principles of Programming Using C Laboratory
28 pages
ConocoPhilips Addendum To Norsok Z-010
No ratings yet
ConocoPhilips Addendum To Norsok Z-010
11 pages
Job Safety Analysis: Perform The Weld
0% (1)
Job Safety Analysis: Perform The Weld
4 pages