Lect 02

Parallel Programming
Overview
Historical Perspective:
Moore’s Law
Gordon Moore (in 1965):
 the complexity of semiconductor components had doubled
each year since 1959
 exponential growth!
this became known as Moore’s Law

Intel Processor Performance
per Cycle
(Olukotun 2005)
Another New Path: Parallelism
Lower development cost:

 combine processor cores Processors P P
Tolerate defects: kusur, arıza
C C
 disable any faulty processor Caches
C
Chip Multiprocessor (CMP)
many advantages
Diverse Parallel Architectures
Proc
Proc Proc
Cache
Cache
Simultaneous-
Chip Multiprocessor Multithreading
Supercomputers (CMP) (SMT)
(large-scale multiprocessors)
improvements through full/partial hw replication !

Optimizing Thread Parallelism
Threads
Proc
Proc Proc
Cache
Cache
Desktops Simultaneous-
Supercomputers Chip Multiprocessor Multithreading
(CMP) (SMT)
multithreading in every scale of machine!

Chip Multi-Processors (CMP) Historical
Perspective ~ 2005
IBM:
 Power 5 (2-core)
Intel:
 Montecito (2-core Itanium)
 Kentsfield (4-core P4)…
AMD:
 dual-core Opteron, Athlon X2 Power 5 dual-core Intel chip
 Quad-core Opteron
Sun:
 UltraSparc T1: 32 cores
 UltraSparc T2: 64 cores
Sony, Toshiba, IBM:

 Cell:9 cores
… …
dual-core Cell
Opteron
abundant cores in Chip Multiprocessors – growing trend
Chip Multi-Processors (CMP) Historical
Perspective: cca 2005 vs Now
IBM:
 Power 5 (2-core)
 Power 9 … 40 cores
Intel:
 Montecito (2-core Itanium)
 Kentsfield (4-core P4)…
 Intel Xeon Ice Lake – 40 cores,
80 threads
AMD:
 dual-core Opteron, Athlon X2
 Quad-core Opteron
 Threadripper/Epyc: 64 cores,
128 threads
bolluk, çok
abundant cores in Chip Multiprocessors – growing trend

IBM Power 9 (today, 15+
years)
Delivers up to 40 cores, either 4-

way or 8-way SMT, 2 TB of RAM
and up to 24 SAS/SATA drives.
Prerequisites önkoşullar
 Programming in C or C++
 Data structures
 Basics of machine architecture
 Basics of network programming
Parallel vs. Distributed
Programming
Parallel programming has matured:
 Few standard programming models
 Few common machine architectures
 Portability between models and architectures
Bottom Line sonuç
 Programmer can now focus on program and use suitable

programming model
 Reasonable hope of portability
 Problem: much performance optimization is still
platform-dependent
 Performance portability is a problem
Parallelism
 Ability to execute different parts of a program concurrently on different

processors
 Goal: shorten execution time
Measures of Performance
hızlanma
 To computer scientists: speedup, execution time.
 To applications people: size of problem, accuracy of
solution, etc.
Speedup of Algorithm
Speedup of algorithm = sequential

the ratio of the
 compute time for
the sequential
execution time / execution time on algorithm to the
time for the parallel
p processors (with the same data algorithm. If the

speedup factor is
set). speedup
n, then we say we
have n-fold
speedup. For
example, if a
sequential
algorithm requires
10 min of compute
time and a
corresponding
parallel algorithm
requires 2 min, we
say that there is 5-
fold speedup.
p
Speedup on Problem
 Speedup on problem = sequential execution time of best But if the four

known sequential algorithm / execution time on p students each
write 1/4 of the
processors. body of the report
(2 hours, in 4-fold
 A more honest measure of performance. parallelism), then
one student
 Avoids picking an easily parallelizable algorithm with writes the
summary, then
poor sequential execution time. the elapsed time
would be 3 hours
Amdahl’s Law It may seem surprising that we obtain only 3-fold overall —for a 3-fold
computes the speedup when 90% of the algorithm achieves 4-fold speedup. overall speedup.
overall speedup, This is a lesson of Amdahl’s Law: the non-parallelizable The sequential
taking into account portion of the algorithm has a disproportionate effect on the portion of the
that the sequential overall speedup. task has a
portion of the A non-computational example may help explain this effect. disproportionate
algorithm has no Suppose that a team of four students is producing a report, effect because
speedup, but the together with an executive summary, where the main body of the other three
parallel portion of the report requires 8 hours to write, and the executive students have
the algorithm has summary requires one hour to write and must have a single nothing to do
speedup S. author (representing a sequential task). If only one person during that
wrote the entire report, it would require 9 hours. portion of the
task.
What Speedups Can You Get?
 Linear speedup
 Confusing term: implicitly means a 1-to-1 speedup per
processor.
 (almost always) as good as you can do.
 Sub-linear speedup: more normal due to overhead of
startup, synchronization, communication, etc.
overhead is the
time and resources
spent on
transferring data
and messages
between the
parallel
components, such
as processors,
cores, nodes, or
clusters.
Speedup
speedup
linear
actual
p
Scalability
kesin
 No really precise decision.
 Roughly speaking, a program is said to scale to a certain
number of processors p, if going from p-1 to p
processors results in some acceptable improvement in
speedup (for instance, an increase of 0.5).
Super-linear Speedup?
 Due to cache/memory effects:

 Subparts fit into cache/memory of each node.
 Whole problem does not fit in cache/memory of a single
node.
 Nondeterminism in search problems.
 One thread finds near-optimal solution very quickly =>
leads to drastic pruning of search space.
Bir iş parçacığı en iyiye yakın çözümü çok hızlı bir şekilde bulur => arama alanının büyük
ölçüde budanmasına yol açar.
Cardinal Performance Rule
 Don’t leave (too) much of your code sequential!
In parallel computing, the goal is to execute multiple tasks simultaneously,

taking advantage of the processing power of multiple CPU cores or
computing resources. This can significantly speed up the execution of
programs, especially those that involve computationally intensive tasks or
large datasets.
The Cardinal Performance Rule emphasizes the importance of parallelizing
your code effectively to maximize performance. Leaving too much of your
code in a sequential form means that it is not taking full advantage of
parallel processing capabilities. Instead, tasks are executed one after
another, potentially leading to inefficiencies and longer execution times.
By parallelizing code effectively, developers can harness the full power of
modern computing architectures and achieve better performance and
scalability for their applications.
Amdahl’s Law
 If 1/s of the program is sequential, then you can never

get a speedup better than s.
 (Normalized) sequential execution time = 1/s +
(1- 1/s) = 1
 Best parallel execution time on p processors = 1/s +
(1 - 1/s) /p
 When p goes to infinity, parallel execution =
1/s
 Speedup = s. Speedup is the ratio of the execution time of the
original (serial) task to the execution time of the
parallelized task.
P is the proportion of the task that can be parallelized.
N is the number of processing units (CPU cores,
1 / (1-P) + P/N nodes, etc.) available for parallel execution
Why keep something
sequential?
 Some parts of the program are not parallelizable
(because of dependences)
 Some parts may be parallelizable, but the overhead
dwarfs the increased speedup.
azaltır, küçültür,
daraltır
When can two statements
execute in parallel?
 On one processor:
statement 1;
statement 2;
 On two processors:
processor1: processor2:
statement1; statement2;
Fundamental Assumption
temel varsayım
 Processors execute independently: no control over order

of execution between processors
When can 2 statements execute
in parallel?
 Possibility 1
Processor1: Processor2:
statement1;
statement2;
 Possibility 2
Processor1: Processor2:
statement2:
statement1;
in parallel?
 Their order of execution must not matter!
 In other words,
must be equivalent to
Example 1
a = 1;
b = a;
 Statements cannot be executed in parallel

 Program modifications may make it possible.
Example 2
a = f(x);
b = a;
 May not be wise to change the program (sequential

execution would take longer).
Example 3
a = 1;
a = 2;
 Statements cannot be executed in parallel.

True dependence
Statements S1, S2
A true dependence
occurs when a
location in memory
S2 has a true dependence on S1 is written to before
it is read.
iff It, also known as a
flow dependency
S2 reads a value written by S1 or data
dependency,
occurs when an
instruction
depends on the
result of a previous
instruction. A
for (j = 1; j < n; j++) violation of a true
S1: a[j] = a[j-1]; dependency leads
to a read-after-
write (RAW)
hazard.
Anti-dependence
An anti dependence
occurs when a location in
memory is read before
Statements S1, S2. that same location is
written to.
It occurs when an
instruction requires a
value that is later
S2 has an anti-dependence on S1 updated. A violation of an
anti-dependency leads to
iff a write-after-read (WAR)
hazard.
S2 writes a value read by S1. 1. B = 3
2. A = B + 1
3. B = 7
In the following example,
instruction 2 anti-depends
for (j = 0; j < n; j++) on instruction 3. This is
S1: b[j] = b[j+1]; an anti dependence
because variable b is first
read in statement S1 and
then variable b is written
to in statement S2.— the
ordering of these
instructions cannot be
changed
Output Dependence An output dependence
occurs when a location in
memory is written to
before that same location
is written to again in
another statement.
It occurs when the
ordering of instructions will
Statements S1, S2. affect the final output
value of a variable. A
violation of an output
dependency leads to an
S2 has an output dependence on S1 write-after-write (WAW)
hazard.
iff 1. B = 3
2. A = B + 1
3. B = 7
S2 writes a variable written by S1. In the example below,
there is an output
dependency between
instructions 3 and 1 —
changing the ordering of
instructions in this
example will change the
for (j = 0; j < n; j++) final value of A.
{ As with anti-
S1: c[j] = j; dependencies, output
S2: c[j+1] = 5; dependencies are name
} dependencies. That is,
they may be removed
through renaming of
variables
in parallel?
S1 and S2 can execute in parallel

iff
there are no dependences between S1 and S2
 true dependences
 anti-dependences
 output dependences
Some dependences can be removed.
Example 4
 Most parallelism occurs in loops .
for(i=0; i<100; i++)

a[i] = i;
 No dependences.
 Iterations can be executed in parallel.
Example 5
for(i=0; i<100; i++) {

a[i] = i;
b[i] = 2*i;
}
Iterations and statements can be executed in parallel.

Example 6
for(i=0;i<100;i++) a[i] = i;
for(i=0;i<100;i++) b[i] = 2*i;
Iterations and loops can be executed in parallel.

Example 7
for(i=0; i<100; i++)

a[i] = a[i] + 100;
 There is a dependence … on itself!

 Loop is still parallelizable.
Example 8
for( i=0; i<100; i++ )

a[i] = f(a[i-1]);
 Dependence between a[i] and a[i-1].

 Loop iterations are not parallelizable.
Loop-carried dependence
 A loop carried dependence is a dependence that is

present only if the statements are part of the execution
of a loop.
 Otherwise, we call it a loop-independent dependence.
 Loop-carried dependences prevent loop iteration
parallelization.
When a statement in one iteration of a loop depends in some way

on a statement in a different iteration of the same loop, a loop- // Code block 1 // Code block 2
carried dependence exists. However, if a statement in one iteration for (i = 1; i <= 4; i+ for (i = 0; i < 4; i++)
of a loop depends only on a statement in the same iteration of the +) { {
loop, this creates a loop independent dependence. In this example, S1: b[i] = 8; S1: b[i] = 8;
code block 1 shows loop-dependent dependence between S2: a[i] = b[i-1] + S2: a[i] = b[i] +
statement S2 iteration i and statement S1 iteration i-1. This is to say 10; 10;
that statement S2 cannot proceed until statement S1 in the previous } }
iteration finishes. Code block 2 show loop independent dependence
between statements S1 and S2 in the same iteration.
Example 9
for(i=0; i<100; i++ )

for(j=0; j<100; j++ )
a[i][j] = f(a[i][j-1]);
 Loop-independent dependence on i.
 Loop-carried dependence on j.
 Outer loop can be parallelized, inner
loop cannot.
Example 10
for( j=0; j<100; j++ )

for( i=0; i<100; i++ ) -1 gibi değişiklik
olan looplar
a[i][j] = f(a[i][j-1]); parallel olamaz
 Inner loop can be parallelized, outer loop cannot.

 Less desirable situation.
 Loop interchange is sometimes possible.
Level of loop-carried
dependence
 Is the nesting depth of the loop that carries the
dependence.
 Indicates which loops can be parallelized.
Be careful … Example 11
printf(“a”);
printf(“b”);
Statements have a hidden output dependence due to the like ab or ba

output stream. When you have consecutive printf() statements like printf("a"); This hidden output
dependence
followed by printf("b");, the output of both statements will highlights the
ultimately be directed to the same output stream. Even though importance of
the two statements do not share any data dependencies in considering not
terms of program variables or memory locations, they do share only data
a hidden output dependence due to their interaction with the dependencies but
output stream. also output
This hidden output dependence means that the order in which dependencies
the printf() statements are executed matters. If the output when analyzing
stream were to interleave the characters from each printf() call and optimizing
in a different order, the overall output would change. For code for parallel
example, if the output stream were to buffer the characters and execution or other
then flush them to the console, the characters from the second forms of
printf() might appear before those from the first one. concurrency.
a = f(x);
b = g(x);
Statements could have a hidden dependence if f and g

update the same variable.
Also depends on what f and g can do to x.
for(i=0; i<100; i++)

a[i+10] = f(a[i]);
 Dependence between a[10], a[20], …

 Dependence between a[11], a[21], …
 …
 Some parallel execution is possible.
for( i=1; i<100;i++ ) {

a[i] = …;
... = a[i-1];
}
 Dependence between a[i] and a[i-1]
 Complete parallel execution impossible
 Pipelined parallel execution possible
for( i=0; i<100; i++ )

a[i] = f(a[indexa[i]]);
 Cannot tell for sure.

 Parallelization depends on user knowledge of values in
indexa[].
 User can tell, compiler cannot.
Optimizations: Example 16
each iteration depends on the result of
the previous iteration because the value
of a[i + 1000] is calculated based on the
for (i = 0; i < 100000; i++) value of a[i]. Therefore, the iterations
must be executed sequentially to
a[i + 1000] = a[i] + 1; maintain the correct order of operations.
Parallelizing this loop would mean
allowing multiple iterations to execute
concurrently, which could lead to race
conditions and incorrect results because
each iteration relies on the result of the
previous one.
Cannot be parallelized as is.

May be parallelized by applying certain
code transformations.
An aside
 Parallelizing compilers analyze program dependences to

decide parallelization.
 In parallelization by hand, user does the same analysis.
 Compiler more convenient and more correct
 User more powerful, can analyze more patterns.
To remember
 Statement order must not matter.

 Statements must not have dependences.
 Some dependences can be removed.
 Some dependences may not be obvious.

Lect 02

Uploaded by

Copyright:

Available Formats

Lect 02

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lect 02

Uploaded by

Copyright:

Available Formats

Parallel Programming

this became known as Moore’s Law

Lower development cost:

improvements through full/partial hw replication !

multithreading in every scale of machine!

Sony, Toshiba, IBM:

abundant cores in Chip Multiprocessors – growing trend

Delivers up to 40 cores, either 4-

 Programmer can now focus on program and use suitable

 Ability to execute different parts of a program concurrently on different

Speedup of algorithm = sequential

p processors (with the same data algorithm. If the

 Speedup on problem = sequential execution time of best But if the four

 Due to cache/memory effects:

 Don’t leave (too) much of your code sequential!

In parallel computing, the goal is to execute multiple tasks simultaneously,

 If 1/s of the program is sequential, then you can never

 Processors execute independently: no control over order

 Their order of execution must not matter!

 Statements cannot be executed in parallel

 May not be wise to change the program (sequential

 Statements cannot be executed in parallel.

S1 and S2 can execute in parallel

 Most parallelism occurs in loops .

for(i=0; i<100; i++)

for(i=0; i<100; i++) {

Iterations and statements can be executed in parallel.

Iterations and loops can be executed in parallel.

for(i=0; i<100; i++)

 There is a dependence … on itself!

for( i=0; i<100; i++ )

 Dependence between a[i] and a[i-1].

 A loop carried dependence is a dependence that is

When a statement in one iteration of a loop depends in some way

for(i=0; i<100; i++ )

for( j=0; j<100; j++ )

 Inner loop can be parallelized, outer loop cannot.

Statements have a hidden output dependence due to the like ab or ba

Statements could have a hidden dependence if f and g

for(i=0; i<100; i++)

 Dependence between a[10], a[20], …

for( i=1; i<100;i++ ) {

for( i=0; i<100; i++ )

 Cannot tell for sure.

Cannot be parallelized as is.

 Parallelizing compilers analyze program dependences to

 Statement order must not matter.

You might also like