Lect 02
Lect 02
Lect 02
Overview
Historical Perspective:
Moore’s Law
Gordon Moore (in 1965):
the complexity of semiconductor components had doubled
each year since 1959
exponential growth!
(Olukotun 2005)
Another New Path: Parallelism
many advantages
Diverse Parallel Architectures
Proc
Proc Proc
Cache
Cache
Simultaneous-
Chip Multiprocessor Multithreading
Supercomputers (CMP) (SMT)
(large-scale multiprocessors)
Proc
Proc Proc
Cache
Cache
Desktops Simultaneous-
Supercomputers Chip Multiprocessor Multithreading
(CMP) (SMT)
Intel:
Montecito (2-core Itanium)
Kentsfield (4-core P4)…
AMD:
dual-core Opteron, Athlon X2 Power 5 dual-core Intel chip
Quad-core Opteron
Sun:
UltraSparc T1: 32 cores
UltraSparc T2: 64 cores
dual-core Cell
Opteron
abundant cores in Chip Multiprocessors – growing trend
Chip Multi-Processors (CMP) Historical
Perspective: cca 2005 vs Now
IBM:
Power 5 (2-core)
Power 9 … 40 cores
Intel:
Montecito (2-core Itanium)
Kentsfield (4-core P4)…
Intel Xeon Ice Lake – 40 cores,
80 threads
AMD:
dual-core Opteron, Athlon X2
Quad-core Opteron
Threadripper/Epyc: 64 cores,
128 threads
bolluk, çok
Programming in C or C++
Data structures
Basics of machine architecture
Basics of network programming
Parallel vs. Distributed
Programming
Parallel programming has matured:
Few standard programming models
Few common machine architectures
Portability between models and architectures
Bottom Line sonuç
hızlanma
To computer scientists: speedup, execution time.
To applications people: size of problem, accuracy of
solution, etc.
Speedup of Algorithm
set). speedup
n, then we say we
have n-fold
speedup. For
example, if a
sequential
algorithm requires
10 min of compute
time and a
corresponding
parallel algorithm
requires 2 min, we
say that there is 5-
fold speedup.
p
Speedup on Problem
Linear speedup
Confusing term: implicitly means a 1-to-1 speedup per
processor.
(almost always) as good as you can do.
Sub-linear speedup: more normal due to overhead of
startup, synchronization, communication, etc.
overhead is the
time and resources
spent on
transferring data
and messages
between the
parallel
components, such
as processors,
cores, nodes, or
clusters.
Speedup
speedup
linear
actual
p
Scalability
kesin
No really precise decision.
Roughly speaking, a program is said to scale to a certain
number of processors p, if going from p-1 to p
processors results in some acceptable improvement in
speedup (for instance, an increase of 0.5).
Super-linear Speedup?
On one processor:
statement 1;
statement 2;
On two processors:
processor1: processor2:
statement1; statement2;
Fundamental Assumption
temel varsayım
Possibility 1
Processor1: Processor2:
statement1;
statement2;
Possibility 2
Processor1: Processor2:
statement2:
statement1;
When can 2 statements execute
in parallel?
In other words,
statement1; statement2;
must be equivalent to
statement2; statement1;
Example 1
a = 1;
b = a;
a = f(x);
b = a;
a = 1;
a = 2;
Statements S1, S2
A true dependence
occurs when a
location in memory
S2 has a true dependence on S1 is written to before
it is read.
iff It, also known as a
flow dependency
S2 reads a value written by S1 or data
dependency,
occurs when an
instruction
depends on the
result of a previous
instruction. A
for (j = 1; j < n; j++) violation of a true
S1: a[j] = a[j-1]; dependency leads
to a read-after-
write (RAW)
hazard.
Anti-dependence
An anti dependence
occurs when a location in
memory is read before
Statements S1, S2. that same location is
written to.
It occurs when an
instruction requires a
value that is later
S2 has an anti-dependence on S1 updated. A violation of an
anti-dependency leads to
iff a write-after-read (WAR)
hazard.
S2 writes a value read by S1. 1. B = 3
2. A = B + 1
3. B = 7
In the following example,
instruction 2 anti-depends
for (j = 0; j < n; j++) on instruction 3. This is
S1: b[j] = b[j+1]; an anti dependence
because variable b is first
read in statement S1 and
then variable b is written
to in statement S2.— the
ordering of these
instructions cannot be
changed
Output Dependence An output dependence
occurs when a location in
memory is written to
before that same location
is written to again in
another statement.
It occurs when the
ordering of instructions will
Statements S1, S2. affect the final output
value of a variable. A
violation of an output
dependency leads to an
S2 has an output dependence on S1 write-after-write (WAW)
hazard.
iff 1. B = 3
2. A = B + 1
3. B = 7
S2 writes a variable written by S1. In the example below,
there is an output
dependency between
instructions 3 and 1 —
changing the ordering of
instructions in this
example will change the
for (j = 0; j < n; j++) final value of A.
{ As with anti-
S1: c[j] = j; dependencies, output
S2: c[j+1] = 5; dependencies are name
} dependencies. That is,
they may be removed
through renaming of
variables
When can 2 statements execute
in parallel?
No dependences.
Iterations can be executed in parallel.
Example 5
for(i=0;i<100;i++) a[i] = i;
for(i=0;i<100;i++) b[i] = 2*i;
Loop-independent dependence on i.
Loop-carried dependence on j.
Outer loop can be parallelized, inner
loop cannot.
Example 10
printf(“a”);
printf(“b”);
a = f(x);
b = g(x);