Introduction to Parallel and Distributed Computing
Introduction to Parallel and Distributed Computing
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 1 / 34
Introduction to parallel and distributed computing
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 2 / 34
Plan
3. Concurrency platforms
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 3 / 34
Outline
3. Concurrency platforms
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 4 / 34
The CPU-Memory GAP
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 5 / 34
The CPU-Memory GAP
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 5 / 34
The CPU-Memory GAP
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 6 / 34
Hierarchical memory
Capacity
Access Time Staging
Cost Xfer Unit
CPU Registers
Registers Upper Level
100s Bytes
300 – 500 ps (0.3-0.5 ns) prog./compiler
Instr. Operands 1-8 bytes faster
L1 and
d L2 Cache
C h L1 Cache
10s-100s K Bytes cache cntl
~1 ns - ~10 ns Blocks 32-64 bytes
$1000s/ GByte
L2 Cache
cache
h cntltl
Main Memory Blocks 64-128 bytes
G Bytes
80ns- 200ns Memory
~ $100/ GByte
OS
Pages 4K-8K bytes
Disk
10s T Bytes, 10 ms Disk
(10,000,000 ns)
~ $1 / GByte user/operator
Files Mbytes
Larger
Tape
infinite Tape Lower Level
sec-min
sec min
~$1 / GByte
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 6 / 34
Hierarchical memory
Capacity
Access Time Staging
Cost Xfer Unit
CPU Registers
Registers Upper Level
100s Bytes
300 – 500 ps (0.3-0.5 ns) prog./compiler
Instr. Operands 1-8 bytes faster
L1 and
d L2 Cache
C h L1 Cache
10s-100s K Bytes cache cntl
~1 ns - ~10 ns Blocks 32-64 bytes
$1000s/ GByte
L2 Cache
cache
h cntltl
Main Memory Blocks 64-128 bytes
G Bytes
80ns- 200ns Memory
~ $100/ GByte
OS
Pages 4K-8K bytes
Disk
10s T Bytes, 10 ms Disk
(10,000,000 ns)
~ $1 / GByte user/operator
Files Mbytes
Larger
Tape
infinite Tape Lower Level
sec-min
sec min
~$1 / GByte
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 6 / 34
Moore’s law
The Pentium Family: Do not rewrite software, just buy a new machine!
https://en.wikipedia.org/wiki/Moore%27s_law
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 7 / 34
From Moore’s law to multicore processors
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 9 / 34
Multicore processors
∎ In the 1st Gen. Intel Core i7, each core had an L1 data cache and an
L1 instruction cache, together with a unified L2 cache
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 10 / 34
Multicore processors
∎ In the 1st Gen. Intel Core i7, each core had an L1 data cache and an
L1 instruction cache, together with a unified L2 cache
∎ The cores share an L3 cache
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 10 / 34
Multicore processors
∎ In the 1st Gen. Intel Core i7, each core had an L1 data cache and an
L1 instruction cache, together with a unified L2 cache
∎ The cores share an L3 cache
∎ Note the sizes of the successive caches
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 10 / 34
Graphics processing units (GPUs)
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 11 / 34
Graphics processing units (GPUs)
∎ In a GPU, the small local memories have much smaller access time
than the large shared memory.
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 12 / 34
Graphics processing units (GPUs)
∎ In a GPU, the small local memories have much smaller access time
than the large shared memory.
∎ Thus, as much as possible, cores access data in the local memories
while the shared memory should essentially be used for data exchange
between SMs.
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 12 / 34
Distributed Memory
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 13 / 34
Distributed Memory
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 13 / 34
Distributed Memory
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 13 / 34
Distributed Memory
∎ The largest and fastest computers in the world today employ both
shared and distributed memory architectures.
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 14 / 34
Hybrid Distributed-Shared Memory
∎ The largest and fastest computers in the world today employ both
shared and distributed memory architectures.
∎ Current trends seem to indicate that this type of memory architecture
will continue to prevail.
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 14 / 34
Hybrid Distributed-Shared Memory
∎ The largest and fastest computers in the world today employ both
shared and distributed memory architectures.
∎ Current trends seem to indicate that this type of memory architecture
will continue to prevail.
∎ While this model allows for applications to scale, it increases the
complexity of writing computer programs.
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 14 / 34
Outline
3. Concurrency platforms
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 15 / 34
Divide-and-Conquer
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 16 / 34
Divide-and-Conquer and Fork-Join
Join
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 17 / 34
Divide-and-Conquer and Fork-Join
Join
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 17 / 34
Divide-and-Conquer and Fork-Join
∎ Recursively applying
fork-join can “easily”
Join parallelize a
divide-and-conquer
algorithm
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 17 / 34
Map
∎ Simultaneously execute a function on each data item in a collection
Input
Data Item
Function Execution
Output
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 18 / 34
Map
∎ Simultaneously execute a function on each data item in a collection
∎ If more data items than threads, apply the pattern block-wise:
(1) partition the collection, and (2) apply one thread to each part
Input
Data Item
Function Execution
Output
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 18 / 34
Map
∎ Simultaneously execute a function on each data item in a collection
∎ If more data items than threads, apply the pattern block-wise:
(1) partition the collection, and (2) apply one thread to each part
∎ This pattern is often simplified as just a parallel_for loop
Input
Data Item
Function Execution
Output
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 18 / 34
Map
∎ Simultaneously execute a function on each data item in a collection
∎ If more data items than threads, apply the pattern block-wise:
(1) partition the collection, and (2) apply one thread to each part
∎ This pattern is often simplified as just a parallel_for loop
∎ Where multiple map steps are performed in a row,
they may operate in lockstep
Input
Data Item
Function Execution
Output
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 18 / 34
Workpile
∎ Workpile generalizes map pattern to a queue of tasks
Input
...
Function Execution
...
...
Output
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 19 / 34
Workpile
∎ Workpile generalizes map pattern to a queue of tasks
∎ Tasks in-flight can add new tasks to input queue
Input
...
Function Execution
...
...
Output
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 19 / 34
Workpile
∎ Workpile generalizes map pattern to a queue of tasks
∎ Tasks in-flight can add new tasks to input queue
∎ Threads take tasks from queue until it is empty
Input
...
Function Execution
...
...
Output
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 19 / 34
Workpile
∎ Workpile generalizes map pattern to a queue of tasks
∎ Tasks in-flight can add new tasks to input queue
∎ Threads take tasks from queue until it is empty
∎ Can be seen as a parallel_while loop
Input
...
Function Execution
...
...
Output
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 19 / 34
Reduction
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 20 / 34
Reduction
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 20 / 34
Reduction
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 20 / 34
Reduction
Data Queue
Producer Consumer
...
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 21 / 34
Producer-Consumer
Data Queue
Producer Consumer
...
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 21 / 34
Producer-Consumer
∎ The consumer processes data items, pulling them from the queue
Data Queue
Producer Consumer
...
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 21 / 34
Producer-Consumer
∎ The consumer processes data items, pulling them from the queue
Data Queue
Producer Consumer
...
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 21 / 34
Producer-Consumer
∎ The consumer processes data items, pulling them from the queue
Data Queue
Producer Consumer
...
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 22 / 34
Pipeline
∎ A sequence of stages where the output of one stage is used as the
input to another
∎ Example: in a pipelined processor, instructions flow through the
central processing unit (CPU) in stages (Instruction Fetch, Decode,
Execute, etc.)
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 22 / 34
Pipeline
∎ A sequence of stages where the output of one stage is used as the
input to another
∎ Example: in a pipelined processor, instructions flow through the
central processing unit (CPU) in stages (Instruction Fetch, Decode,
Execute, etc.)
∎ Two consecutive stages form a producer-consumer pair
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 22 / 34
Pipeline
∎ A sequence of stages where the output of one stage is used as the
input to another
∎ Example: in a pipelined processor, instructions flow through the
central processing unit (CPU) in stages (Instruction Fetch, Decode,
Execute, etc.)
∎ Two consecutive stages form a producer-consumer pair
∎ Internal stages are both producer and consumer
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 22 / 34
Pipeline
∎ A sequence of stages where the output of one stage is used as the
input to another
∎ Example: in a pipelined processor, instructions flow through the
central processing unit (CPU) in stages (Instruction Fetch, Decode,
Execute, etc.)
∎ Two consecutive stages form a producer-consumer pair
∎ Internal stages are both producer and consumer
∎ Typically, a pipeline is constructed statically through code
organization
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 22 / 34
Pipeline
∎ A sequence of stages where the output of one stage is used as the
input to another
∎ Example: in a pipelined processor, instructions flow through the
central processing unit (CPU) in stages (Instruction Fetch, Decode,
Execute, etc.)
∎ Two consecutive stages form a producer-consumer pair
∎ Internal stages are both producer and consumer
∎ Typically, a pipeline is constructed statically through code
organization
∎ Pipelines can be created dynamically and implicitly with
AsyncGenerators and the call-stack
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 22 / 34
Pascal triangle construction: a stencil computation
0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1
1 2 3 4 5 6 7 8
1 3 6 10 15 21 28
4 56
1 10 20 35
1 5 15 35 70
1 6 21 56
1 7 28
1 8
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 23 / 34
Pascal triangle construction: a stencil computation
0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1
1 2 3 4 5 6 7 8
1 3 6 10 15 21 28
4 56
1 10 20 35
1 5 15 35 70
1 6 21 56
1 7 28
1 8
I II
I II
II
II III
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 24 / 34
Divide and conquer: principle
I II
I II
II
II III
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 24 / 34
Blocking strategy: principle
0 0 0 0 0 0 0 0
a7
4
a6 1 2 3
a5
a4 2 3 4
a3
4
a2 3
a1
4
a0
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 25 / 34
Blocking strategy: principle
0 0 0 0 0 0 0 0
a7
4
a6 1 2 3
a5
a4 2 3 4
a3
4
a2 3
a1
4
a0
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 25 / 34
Outline
3. Concurrency platforms
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 26 / 34
Programming patterms in Julia
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _‘ | |
| | |_| | | | (_| | | Version 1.7.1 (2021-12-22)
_/ |\__’_|_|_|\__’_| | Official https://julialang.org/ release
|__/ |
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 27 / 34
Julia
1 f u n c t i o n pmap (f , lst )
2 np = nprocs () # the number of processes available
3 n = length ( lst )
4 results = Vector { Any }( n )
5 i = 1
6 # f u n c t i o n to produce the next work item from the queue .
7 nextidx () = ( idx = i ; i + = 1 ; idx )
8 @sync begin
9 for p = 1 : np
0 if p ! = myid () || np = = 1
1 @async begin
2 while true
3 idx = nextidx ()
4 if idx > n
5 break
6 end
7 results [ idx ] = r e m o t ec a l l _ f e t c h (f ,p , lst [ idx ])
8 end
9 end
0 end
1 end
2 end
3 results
4 end
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 28 / 34
Fork-Join with Cilk
int fib(int n)
{
if (n < 2) return n;
int x, y;
x = cilk_spawn fib(n-1);
y = fib(n-2);
cilk_sync;
return x+y;
}
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 29 / 34
Cilk
∎ Cilk has been developed since 1994 at the MIT Laboratory for
Computer Science by Prof. Charles E. Leiserson and his group, in
particular by Matteo Frigo and Tao B. Schardl
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 30 / 34
Cilk
∎ Cilk has been developed since 1994 at the MIT Laboratory for
Computer Science by Prof. Charles E. Leiserson and his group, in
particular by Matteo Frigo and Tao B. Schardl
∎ Cilk is a multithreaded language for parallel programming that
generalizes the semantics of C (resp. C++) by introducing linguistic
constructs for parallel control.
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 30 / 34
Cilk
∎ Cilk has been developed since 1994 at the MIT Laboratory for
Computer Science by Prof. Charles E. Leiserson and his group, in
particular by Matteo Frigo and Tao B. Schardl
∎ Cilk is a multithreaded language for parallel programming that
generalizes the semantics of C (resp. C++) by introducing linguistic
constructs for parallel control.
∎ Cilk is a faithful extension of C (resp. C++). That is, the C (resp.
C++) elision of a Cilk program is a correct implementation of the
semantics of that program.
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 30 / 34
Cilk
∎ Cilk has been developed since 1994 at the MIT Laboratory for
Computer Science by Prof. Charles E. Leiserson and his group, in
particular by Matteo Frigo and Tao B. Schardl
∎ Cilk is a multithreaded language for parallel programming that
generalizes the semantics of C (resp. C++) by introducing linguistic
constructs for parallel control.
∎ Cilk is a faithful extension of C (resp. C++). That is, the C (resp.
C++) elision of a Cilk program is a correct implementation of the
semantics of that program.
∎ Cilk’s scheduler maps strands onto processors dynamically at
runtime, using the work-stealing principle. Under reasonable
assumptions, this provives a guarantee of performance.
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 30 / 34
Cilk
∎ Cilk has been developed since 1994 at the MIT Laboratory for
Computer Science by Prof. Charles E. Leiserson and his group, in
particular by Matteo Frigo and Tao B. Schardl
∎ Cilk is a multithreaded language for parallel programming that
generalizes the semantics of C (resp. C++) by introducing linguistic
constructs for parallel control.
∎ Cilk is a faithful extension of C (resp. C++). That is, the C (resp.
C++) elision of a Cilk program is a correct implementation of the
semantics of that program.
∎ Cilk’s scheduler maps strands onto processors dynamically at
runtime, using the work-stealing principle. Under reasonable
assumptions, this provives a guarantee of performance.
∎ Cilk has supporting tools for data race (thus non-deterministic
behaviour) detection and performance analysis.
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 30 / 34
Heterogeneous programming with CUDA
∎ The parallel code is written for a thread
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 31 / 34
Heterogeneous programming with CUDA
∎ The parallel code is written for a thread
ë Each thread is free to execute a unique code path
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 31 / 34
Heterogeneous programming with CUDA
∎ The parallel code is written for a thread
ë Each thread is free to execute a unique code path
ë Built-in thread and block ID variables are used to map each thread
to a specific data tile (see next slide).
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 31 / 34
Heterogeneous programming with CUDA
∎ The parallel code is written for a thread
ë Each thread is free to execute a unique code path
ë Built-in thread and block ID variables are used to map each thread
to a specific data tile (see next slide).
∎ Thus, each thread executes the same code.
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 31 / 34
Heterogeneous programming with CUDA
∎ The parallel code is written for a thread
ë Each thread is free to execute a unique code path
ë Built-in thread and block ID variables are used to map each thread
to a specific data tile (see next slide).
∎ Thus, each thread executes the same code.
∎ However, different threads work on different data, based on their
thread and block IDs.
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 31 / 34
CUDA Example: increment array elements (1/2)
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 32 / 34
CUDA Example: increment array elements (2/2)
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 33 / 34
References
[1] M. McCool, J. Reinders, and A. Robison. Structured parallel
programming: patterns for efficient computation. Elsevier, 2012.
[2] J. E. Savage. Models of computation - exploring the power of
computing. Addison-Wesley, 1998. isbn: 978-0-201-89539-1.
[3] M. L. Scott. Programming Language Pragmatics (3. ed.) Academic
Press, 2009. isbn: 978-0-12-374514-9.
[4] A. Williams. C++ concurrency in action: practical multithreading; 1st
ed. Shelter Island, NY: Manning Publ., 2012. url:
https://cds.cern.ch/record/1483005.
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 34 / 34