Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

Introduction to Parallel and Distributed Computing

Uploaded by

Abuzar Raza
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Introduction to Parallel and Distributed Computing

Uploaded by

Abuzar Raza
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

Introduction to parallel and distributed computing

Marc Moreno Maza

Ontario Research Center for Computer Algebra


Departments of Computer Science and Mathematics
University of Western Ontario, Canada

CS4402 - CS9635, January 30, 2024

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 1 / 34
Introduction to parallel and distributed computing

Marc Moreno Maza

Ontario Research Center for Computer Algebra


Departments of Computer Science and Mathematics
University of Western Ontario, Canada

CS4402 - CS9635, January 30, 2024

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 2 / 34
Plan

1. Hardware architecture and concurrency

2. Parallel programming patterns

3. Concurrency platforms

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 3 / 34
Outline

1. Hardware architecture and concurrency

2. Parallel programming patterns

3. Concurrency platforms

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 4 / 34
The CPU-Memory GAP

∎ In the 1980’s, a memory access and a CPU operation were both as


slow as the other

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 5 / 34
The CPU-Memory GAP

∎ In the 1980’s, a memory access and a CPU operation were both as


slow as the other
∎ CPU frequency increase, between 1985 and 2005, has reduced CPU
op times much more than DRAM technology improvement could
reduce memory access times.

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 5 / 34
The CPU-Memory GAP

∎ In the 1980’s, a memory access and a CPU operation were both as


slow as the other
∎ CPU frequency increase, between 1985 and 2005, has reduced CPU
op times much more than DRAM technology improvement could
reduce memory access times.
∎ Even after the introduction of multicore processors, the gap is still
huge.
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 5 / 34
Hierarchical memory
Capacity
Access Time Staging
Cost Xfer Unit
CPU Registers
Registers Upper Level
100s Bytes
300 – 500 ps (0.3-0.5 ns) prog./compiler
Instr. Operands 1-8 bytes faster
L1 and
d L2 Cache
C h L1 Cache
10s-100s K Bytes cache cntl
~1 ns - ~10 ns Blocks 32-64 bytes
$1000s/ GByte
L2 Cache
cache
h cntltl
Main Memory Blocks 64-128 bytes
G Bytes
80ns- 200ns Memory
~ $100/ GByte
OS
Pages 4K-8K bytes
Disk
10s T Bytes, 10 ms Disk
(10,000,000 ns)
~ $1 / GByte user/operator
Files Mbytes
Larger
Tape
infinite Tape Lower Level
sec-min
sec min
~$1 / GByte

∎ Data moves in blocks (cache-lines, pages) between levels

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 6 / 34
Hierarchical memory
Capacity
Access Time Staging
Cost Xfer Unit
CPU Registers
Registers Upper Level
100s Bytes
300 – 500 ps (0.3-0.5 ns) prog./compiler
Instr. Operands 1-8 bytes faster
L1 and
d L2 Cache
C h L1 Cache
10s-100s K Bytes cache cntl
~1 ns - ~10 ns Blocks 32-64 bytes
$1000s/ GByte
L2 Cache
cache
h cntltl
Main Memory Blocks 64-128 bytes
G Bytes
80ns- 200ns Memory
~ $100/ GByte
OS
Pages 4K-8K bytes
Disk
10s T Bytes, 10 ms Disk
(10,000,000 ns)
~ $1 / GByte user/operator
Files Mbytes
Larger
Tape
infinite Tape Lower Level
sec-min
sec min
~$1 / GByte

∎ Data moves in blocks (cache-lines, pages) between levels


∎ On the right, note the block sizes

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 6 / 34
Hierarchical memory
Capacity
Access Time Staging
Cost Xfer Unit
CPU Registers
Registers Upper Level
100s Bytes
300 – 500 ps (0.3-0.5 ns) prog./compiler
Instr. Operands 1-8 bytes faster
L1 and
d L2 Cache
C h L1 Cache
10s-100s K Bytes cache cntl
~1 ns - ~10 ns Blocks 32-64 bytes
$1000s/ GByte
L2 Cache
cache
h cntltl
Main Memory Blocks 64-128 bytes
G Bytes
80ns- 200ns Memory
~ $100/ GByte
OS
Pages 4K-8K bytes
Disk
10s T Bytes, 10 ms Disk
(10,000,000 ns)
~ $1 / GByte user/operator
Files Mbytes
Larger
Tape
infinite Tape Lower Level
sec-min
sec min
~$1 / GByte

∎ Data moves in blocks (cache-lines, pages) between levels


∎ On the right, note the block sizes
∎ On the left, note the access times, sizes and prices.

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 6 / 34
Moore’s law

The Pentium Family: Do not rewrite software, just buy a new machine!

https://en.wikipedia.org/wiki/Moore%27s_law
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 7 / 34
From Moore’s law to multicore processors

Image taken from Hennessy, Patterson. Computer Architecture, a


quantitative approach. 5 th Ed. 2010.
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 8 / 34
Multicore processors

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 9 / 34
Multicore processors

∎ In the 1st Gen. Intel Core i7, each core had an L1 data cache and an
L1 instruction cache, together with a unified L2 cache

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 10 / 34
Multicore processors

∎ In the 1st Gen. Intel Core i7, each core had an L1 data cache and an
L1 instruction cache, together with a unified L2 cache
∎ The cores share an L3 cache

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 10 / 34
Multicore processors

∎ In the 1st Gen. Intel Core i7, each core had an L1 data cache and an
L1 instruction cache, together with a unified L2 cache
∎ The cores share an L3 cache
∎ Note the sizes of the successive caches

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 10 / 34
Graphics processing units (GPUs)

∎ A GPU consists of a scheduler, a large shared memory and several


streaming multiprocessors (SMs)

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 11 / 34
Graphics processing units (GPUs)

∎ A GPU consists of a scheduler, a large shared memory and several


streaming multiprocessors (SMs)
∎ In addition, each SM has a local (private) small memory.
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 11 / 34
Graphics processing units (GPUs)

∎ In a GPU, the small local memories have much smaller access time
than the large shared memory.

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 12 / 34
Graphics processing units (GPUs)

∎ In a GPU, the small local memories have much smaller access time
than the large shared memory.
∎ Thus, as much as possible, cores access data in the local memories
while the shared memory should essentially be used for data exchange
between SMs.
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 12 / 34
Distributed Memory

∎ Distributed memory systems require a communication network to


connect inter-processor memory.

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 13 / 34
Distributed Memory

∎ Distributed memory systems require a communication network to


connect inter-processor memory.
∎ Processors have their own local memory and operate independently.

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 13 / 34
Distributed Memory

∎ Distributed memory systems require a communication network to


connect inter-processor memory.
∎ Processors have their own local memory and operate independently.
∎ Memory addresses in one processor do not map to another processor,
so there is no concept of global address space across all processors.

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 13 / 34
Distributed Memory

∎ Distributed memory systems require a communication network to


connect inter-processor memory.
∎ Processors have their own local memory and operate independently.
∎ Memory addresses in one processor do not map to another processor,
so there is no concept of global address space across all processors.
∎ Data exchange between processors is managed by the programmer ,
not by the hardware.
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 13 / 34
Hybrid Distributed-Shared Memory

∎ The largest and fastest computers in the world today employ both
shared and distributed memory architectures.

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 14 / 34
Hybrid Distributed-Shared Memory

∎ The largest and fastest computers in the world today employ both
shared and distributed memory architectures.
∎ Current trends seem to indicate that this type of memory architecture
will continue to prevail.

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 14 / 34
Hybrid Distributed-Shared Memory

∎ The largest and fastest computers in the world today employ both
shared and distributed memory architectures.
∎ Current trends seem to indicate that this type of memory architecture
will continue to prevail.
∎ While this model allows for applications to scale, it increases the
complexity of writing computer programs.
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 14 / 34
Outline

1. Hardware architecture and concurrency

2. Parallel programming patterns

3. Concurrency platforms

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 15 / 34
Divide-and-Conquer

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 16 / 34
Divide-and-Conquer and Fork-Join

∎ Fork: divide problem and


execute separate calls in
parallel
Fork

Join

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 17 / 34
Divide-and-Conquer and Fork-Join

∎ Fork: divide problem and


execute separate calls in
parallel
Fork

∎ Join: merge parallel


execution back into serial

Join

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 17 / 34
Divide-and-Conquer and Fork-Join

∎ Fork: divide problem and


execute separate calls in
parallel
Fork

∎ Join: merge parallel


execution back into serial

∎ Recursively applying
fork-join can “easily”
Join parallelize a
divide-and-conquer
algorithm

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 17 / 34
Map
∎ Simultaneously execute a function on each data item in a collection

Input

Data Item

Function Execution

Output

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 18 / 34
Map
∎ Simultaneously execute a function on each data item in a collection
∎ If more data items than threads, apply the pattern block-wise:
(1) partition the collection, and (2) apply one thread to each part

Input

Data Item

Function Execution

Output

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 18 / 34
Map
∎ Simultaneously execute a function on each data item in a collection
∎ If more data items than threads, apply the pattern block-wise:
(1) partition the collection, and (2) apply one thread to each part
∎ This pattern is often simplified as just a parallel_for loop

Input

Data Item

Function Execution

Output

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 18 / 34
Map
∎ Simultaneously execute a function on each data item in a collection
∎ If more data items than threads, apply the pattern block-wise:
(1) partition the collection, and (2) apply one thread to each part
∎ This pattern is often simplified as just a parallel_for loop
∎ Where multiple map steps are performed in a row,
they may operate in lockstep
Input

Data Item

Function Execution

Output

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 18 / 34
Workpile
∎ Workpile generalizes map pattern to a queue of tasks

Input

...

Function Execution
...

...
Output

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 19 / 34
Workpile
∎ Workpile generalizes map pattern to a queue of tasks
∎ Tasks in-flight can add new tasks to input queue

Input

...

Function Execution
...

...
Output

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 19 / 34
Workpile
∎ Workpile generalizes map pattern to a queue of tasks
∎ Tasks in-flight can add new tasks to input queue
∎ Threads take tasks from queue until it is empty

Input

...

Function Execution
...

...
Output

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 19 / 34
Workpile
∎ Workpile generalizes map pattern to a queue of tasks
∎ Tasks in-flight can add new tasks to input queue
∎ Threads take tasks from queue until it is empty
∎ Can be seen as a parallel_while loop

Input

...

Function Execution
...

...
Output

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 19 / 34
Reduction

∎ A reduction combines every element in a collection into one element,


using an associative operator.

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 20 / 34
Reduction

∎ A reduction combines every element in a collection into one element,


using an associative operator.

∎ Example: computing the sum (or product) of 𝑛 matrices.

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 20 / 34
Reduction

∎ A reduction combines every element in a collection into one element,


using an associative operator.

∎ Example: computing the sum (or product) of 𝑛 matrices.

∎ Grouping the operations is often needed to allow for parallelism.

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 20 / 34
Reduction

∎ A reduction combines every element in a collection into one element,


using an associative operator.

∎ Example: computing the sum (or product) of 𝑛 matrices.

∎ Grouping the operations is often needed to allow for parallelism.

∎ This grouping requires associativity, but not commutativity.


Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 20 / 34
Producer-Consumer

∎ Two functions connected by a queue

Data Queue
Producer Consumer

...

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 21 / 34
Producer-Consumer

∎ Two functions connected by a queue

∎ The producer produces data items, pushing them to the queue

Data Queue
Producer Consumer

...

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 21 / 34
Producer-Consumer

∎ Two functions connected by a queue

∎ The producer produces data items, pushing them to the queue

∎ The consumer processes data items, pulling them from the queue

Data Queue
Producer Consumer

...

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 21 / 34
Producer-Consumer

∎ Two functions connected by a queue

∎ The producer produces data items, pushing them to the queue

∎ The consumer processes data items, pulling them from the queue

∎ Producer and consumer execute simultaneously; at least one must be


active at all times Ô⇒ no deadlock

Data Queue
Producer Consumer

...

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 21 / 34
Producer-Consumer

∎ Two functions connected by a queue

∎ The producer produces data items, pushing them to the queue

∎ The consumer processes data items, pulling them from the queue

∎ Producer and consumer execute simultaneously; at least one must be


active at all times Ô⇒ no deadlock

Data Queue
Producer Consumer

...

∎ In some circumstances, the producer may be considered


as an iterator or generator
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 21 / 34
Pipeline
∎ A sequence of stages where the output of one stage is used as the
input to another

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 22 / 34
Pipeline
∎ A sequence of stages where the output of one stage is used as the
input to another
∎ Example: in a pipelined processor, instructions flow through the
central processing unit (CPU) in stages (Instruction Fetch, Decode,
Execute, etc.)

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 22 / 34
Pipeline
∎ A sequence of stages where the output of one stage is used as the
input to another
∎ Example: in a pipelined processor, instructions flow through the
central processing unit (CPU) in stages (Instruction Fetch, Decode,
Execute, etc.)
∎ Two consecutive stages form a producer-consumer pair

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 22 / 34
Pipeline
∎ A sequence of stages where the output of one stage is used as the
input to another
∎ Example: in a pipelined processor, instructions flow through the
central processing unit (CPU) in stages (Instruction Fetch, Decode,
Execute, etc.)
∎ Two consecutive stages form a producer-consumer pair
∎ Internal stages are both producer and consumer

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 22 / 34
Pipeline
∎ A sequence of stages where the output of one stage is used as the
input to another
∎ Example: in a pipelined processor, instructions flow through the
central processing unit (CPU) in stages (Instruction Fetch, Decode,
Execute, etc.)
∎ Two consecutive stages form a producer-consumer pair
∎ Internal stages are both producer and consumer
∎ Typically, a pipeline is constructed statically through code
organization

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 22 / 34
Pipeline
∎ A sequence of stages where the output of one stage is used as the
input to another
∎ Example: in a pipelined processor, instructions flow through the
central processing unit (CPU) in stages (Instruction Fetch, Decode,
Execute, etc.)
∎ Two consecutive stages form a producer-consumer pair
∎ Internal stages are both producer and consumer
∎ Typically, a pipeline is constructed statically through code
organization
∎ Pipelines can be created dynamically and implicitly with
AsyncGenerators and the call-stack

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 22 / 34
Pascal triangle construction: a stencil computation

0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1
1 2 3 4 5 6 7 8
1 3 6 10 15 21 28
4 56
1 10 20 35
1 5 15 35 70
1 6 21 56
1 7 28
1 8

∎ Stencil computations are a class of data processing techniques which


update array elements according to a pattern

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 23 / 34
Pascal triangle construction: a stencil computation

0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1
1 2 3 4 5 6 7 8
1 3 6 10 15 21 28
4 56
1 10 20 35
1 5 15 35 70
1 6 21 56
1 7 28
1 8

∎ Stencil computations are a class of data processing techniques which


update array elements according to a pattern
∎ Construction of the Pascal Triangle: nearly the simplest stencil
computation!
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 23 / 34
Divide and conquer: principle

I II
I II

II
II III

∎ Each triangle region can be computed as a square region followed by


two (concurrent) triangle regions.

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 24 / 34
Divide and conquer: principle

I II
I II

II
II III

∎ Each triangle region can be computed as a square region followed by


two (concurrent) triangle regions.
∎ Each square region can also be computed in a divide and conquer
manner.

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 24 / 34
Blocking strategy: principle

0 0 0 0 0 0 0 0

a7
4
a6 1 2 3
a5

a4 2 3 4
a3
4
a2 3
a1
4
a0

∎ Let 𝐵 be the order of a block and 𝑛 be the number of elements.

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 25 / 34
Blocking strategy: principle

0 0 0 0 0 0 0 0

a7
4
a6 1 2 3
a5

a4 2 3 4
a3
4
a2 3
a1
4
a0

∎ Let 𝐵 be the order of a block and 𝑛 be the number of elements.


∎ Each block is processed serially (as a task) and the set of all blocks is
computed concurrently.

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 25 / 34
Outline

1. Hardware architecture and concurrency

2. Parallel programming patterns

3. Concurrency platforms

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 26 / 34
Programming patterms in Julia

_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _‘ | |
| | |_| | | | (_| | | Version 1.7.1 (2021-12-22)
_/ |\__’_|_|_|\__’_| | Official https://julialang.org/ release
|__/ |

julia> map(x -> x * 2, [1, 2, 3])


3-element Vector{Int64}:
2
4
6

julia> mapreduce(x->x^2, +, [1:3;])


14

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 27 / 34
Julia
1 f u n c t i o n pmap (f , lst )
2 np = nprocs () # the number of processes available
3 n = length ( lst )
4 results = Vector { Any }( n )
5 i = 1
6 # f u n c t i o n to produce the next work item from the queue .
7 nextidx () = ( idx = i ; i + = 1 ; idx )
8 @sync begin
9 for p = 1 : np
0 if p ! = myid () || np = = 1
1 @async begin
2 while true
3 idx = nextidx ()
4 if idx > n
5 break
6 end
7 results [ idx ] = r e m o t ec a l l _ f e t c h (f ,p , lst [ idx ])
8 end
9 end
0 end
1 end
2 end
3 results
4 end
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 28 / 34
Fork-Join with Cilk

int fib(int n)
{
if (n < 2) return n;
int x, y;
x = cilk_spawn fib(n-1);
y = fib(n-2);
cilk_sync;
return x+y;
}

∎ The named child function cilk_spawn fib(n-1) may execute in


parallel with its parent
∎ Cilk keywords cilk_spawn and cilk_sync grant permissions for
parallel execution. They do not command parallel execution.
∎ Visit https://www.opencilk.org/

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 29 / 34
Cilk

∎ Cilk has been developed since 1994 at the MIT Laboratory for
Computer Science by Prof. Charles E. Leiserson and his group, in
particular by Matteo Frigo and Tao B. Schardl

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 30 / 34
Cilk

∎ Cilk has been developed since 1994 at the MIT Laboratory for
Computer Science by Prof. Charles E. Leiserson and his group, in
particular by Matteo Frigo and Tao B. Schardl
∎ Cilk is a multithreaded language for parallel programming that
generalizes the semantics of C (resp. C++) by introducing linguistic
constructs for parallel control.

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 30 / 34
Cilk

∎ Cilk has been developed since 1994 at the MIT Laboratory for
Computer Science by Prof. Charles E. Leiserson and his group, in
particular by Matteo Frigo and Tao B. Schardl
∎ Cilk is a multithreaded language for parallel programming that
generalizes the semantics of C (resp. C++) by introducing linguistic
constructs for parallel control.
∎ Cilk is a faithful extension of C (resp. C++). That is, the C (resp.
C++) elision of a Cilk program is a correct implementation of the
semantics of that program.

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 30 / 34
Cilk

∎ Cilk has been developed since 1994 at the MIT Laboratory for
Computer Science by Prof. Charles E. Leiserson and his group, in
particular by Matteo Frigo and Tao B. Schardl
∎ Cilk is a multithreaded language for parallel programming that
generalizes the semantics of C (resp. C++) by introducing linguistic
constructs for parallel control.
∎ Cilk is a faithful extension of C (resp. C++). That is, the C (resp.
C++) elision of a Cilk program is a correct implementation of the
semantics of that program.
∎ Cilk’s scheduler maps strands onto processors dynamically at
runtime, using the work-stealing principle. Under reasonable
assumptions, this provives a guarantee of performance.

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 30 / 34
Cilk

∎ Cilk has been developed since 1994 at the MIT Laboratory for
Computer Science by Prof. Charles E. Leiserson and his group, in
particular by Matteo Frigo and Tao B. Schardl
∎ Cilk is a multithreaded language for parallel programming that
generalizes the semantics of C (resp. C++) by introducing linguistic
constructs for parallel control.
∎ Cilk is a faithful extension of C (resp. C++). That is, the C (resp.
C++) elision of a Cilk program is a correct implementation of the
semantics of that program.
∎ Cilk’s scheduler maps strands onto processors dynamically at
runtime, using the work-stealing principle. Under reasonable
assumptions, this provives a guarantee of performance.
∎ Cilk has supporting tools for data race (thus non-deterministic
behaviour) detection and performance analysis.

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 30 / 34
Heterogeneous programming with CUDA
∎ The parallel code is written for a thread

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 31 / 34
Heterogeneous programming with CUDA
∎ The parallel code is written for a thread
ë Each thread is free to execute a unique code path

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 31 / 34
Heterogeneous programming with CUDA
∎ The parallel code is written for a thread
ë Each thread is free to execute a unique code path
ë Built-in thread and block ID variables are used to map each thread
to a specific data tile (see next slide).

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 31 / 34
Heterogeneous programming with CUDA
∎ The parallel code is written for a thread
ë Each thread is free to execute a unique code path
ë Built-in thread and block ID variables are used to map each thread
to a specific data tile (see next slide).
∎ Thus, each thread executes the same code.

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 31 / 34
Heterogeneous programming with CUDA
∎ The parallel code is written for a thread
ë Each thread is free to execute a unique code path
ë Built-in thread and block ID variables are used to map each thread
to a specific data tile (see next slide).
∎ Thus, each thread executes the same code.
∎ However, different threads work on different data, based on their
thread and block IDs.

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 31 / 34
CUDA Example: increment array elements (1/2)

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 32 / 34
CUDA Example: increment array elements (2/2)

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 33 / 34
References
[1] M. McCool, J. Reinders, and A. Robison. Structured parallel
programming: patterns for efficient computation. Elsevier, 2012.
[2] J. E. Savage. Models of computation - exploring the power of
computing. Addison-Wesley, 1998. isbn: 978-0-201-89539-1.
[3] M. L. Scott. Programming Language Pragmatics (3. ed.) Academic
Press, 2009. isbn: 978-0-12-374514-9.
[4] A. Williams. C++ concurrency in action: practical multithreading; 1st
ed. Shelter Island, NY: Manning Publ., 2012. url:
https://cds.cern.ch/record/1483005.

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 34 / 34

You might also like