0% found this document useful (0 votes)

2 views

Introduction to Parallel and Distributed Computing

Uploaded by

Abuzar Raza

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Introduction to Parallel and Distributed Computing

Uploaded by

Abuzar Raza

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 78

Introduction to parallel and distributed computing

Marc Moreno Maza

Ontario Research Center for Computer Algebra

Departments of Computer Science and Mathematics
University of Western Ontario, Canada

CS4402 - CS9635, January 30, 2024

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 1 / 34
Introduction to parallel and distributed computing

Marc Moreno Maza

Ontario Research Center for Computer Algebra

Departments of Computer Science and Mathematics
University of Western Ontario, Canada

CS4402 - CS9635, January 30, 2024

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 2 / 34
Plan

1. Hardware architecture and concurrency

2. Parallel programming patterns

3. Concurrency platforms

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 3 / 34
Outline

1. Hardware architecture and concurrency

2. Parallel programming patterns

3. Concurrency platforms

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 4 / 34
The CPU-Memory GAP

∎ In the 1980’s, a memory access and a CPU operation were both as

slow as the other

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 5 / 34
The CPU-Memory GAP

∎ In the 1980’s, a memory access and a CPU operation were both as

slow as the other
∎ CPU frequency increase, between 1985 and 2005, has reduced CPU
op times much more than DRAM technology improvement could
reduce memory access times.

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 5 / 34
The CPU-Memory GAP

∎ In the 1980’s, a memory access and a CPU operation were both as

slow as the other
∎ CPU frequency increase, between 1985 and 2005, has reduced CPU
op times much more than DRAM technology improvement could
reduce memory access times.
∎ Even after the introduction of multicore processors, the gap is still
huge.
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 5 / 34
Hierarchical memory
Capacity
Access Time Staging
Cost Xfer Unit
CPU Registers
Registers Upper Level
100s Bytes
300 – 500 ps (0.3-0.5 ns) prog./compiler
Instr. Operands 1-8 bytes faster
L1 and
d L2 Cache
C h L1 Cache
10s-100s K Bytes cache cntl
~1 ns - ~10 ns Blocks 32-64 bytes
$1000s/ GByte
L2 Cache
cache
h cntltl
Main Memory Blocks 64-128 bytes
G Bytes
80ns- 200ns Memory
~ $100/ GByte
OS
Pages 4K-8K bytes
Disk
10s T Bytes, 10 ms Disk
(10,000,000 ns)
~ $1 / GByte user/operator
Files Mbytes
Larger
Tape
infinite Tape Lower Level
sec-min
sec min
~$1 / GByte

∎ Data moves in blocks (cache-lines, pages) between levels

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 6 / 34
Hierarchical memory
Capacity
Access Time Staging
Cost Xfer Unit
CPU Registers
Registers Upper Level
100s Bytes
300 – 500 ps (0.3-0.5 ns) prog./compiler
Instr. Operands 1-8 bytes faster
L1 and
d L2 Cache
C h L1 Cache
10s-100s K Bytes cache cntl
~1 ns - ~10 ns Blocks 32-64 bytes
$1000s/ GByte
L2 Cache
cache
h cntltl
Main Memory Blocks 64-128 bytes
G Bytes
80ns- 200ns Memory
~ $100/ GByte
OS
Pages 4K-8K bytes
Disk
10s T Bytes, 10 ms Disk
(10,000,000 ns)
~ $1 / GByte user/operator
Files Mbytes
Larger
Tape
infinite Tape Lower Level
sec-min
sec min
~$1 / GByte

∎ Data moves in blocks (cache-lines, pages) between levels

∎ On the right, note the block sizes

∎ Data moves in blocks (cache-lines, pages) between levels

∎ On the right, note the block sizes
∎ On the left, note the access times, sizes and prices.

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 6 / 34
Moore’s law

The Pentium Family: Do not rewrite software, just buy a new machine!

https://en.wikipedia.org/wiki/Moore%27s_law
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 7 / 34
From Moore’s law to multicore processors

Image taken from Hennessy, Patterson. Computer Architecture, a

quantitative approach. 5 th Ed. 2010.
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 8 / 34
Multicore processors

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 9 / 34
Multicore processors

∎ In the 1st Gen. Intel Core i7, each core had an L1 data cache and an
L1 instruction cache, together with a unified L2 cache

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 10 / 34
Multicore processors

∎ In the 1st Gen. Intel Core i7, each core had an L1 data cache and an
L1 instruction cache, together with a unified L2 cache
∎ The cores share an L3 cache

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 10 / 34
Multicore processors

∎ In the 1st Gen. Intel Core i7, each core had an L1 data cache and an
L1 instruction cache, together with a unified L2 cache
∎ The cores share an L3 cache
∎ Note the sizes of the successive caches

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 10 / 34
Graphics processing units (GPUs)

∎ A GPU consists of a scheduler, a large shared memory and several

streaming multiprocessors (SMs)

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 11 / 34
Graphics processing units (GPUs)

∎ A GPU consists of a scheduler, a large shared memory and several

streaming multiprocessors (SMs)
∎ In addition, each SM has a local (private) small memory.
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 11 / 34
Graphics processing units (GPUs)

∎ In a GPU, the small local memories have much smaller access time
than the large shared memory.

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 12 / 34
Graphics processing units (GPUs)

∎ In a GPU, the small local memories have much smaller access time
than the large shared memory.
∎ Thus, as much as possible, cores access data in the local memories
while the shared memory should essentially be used for data exchange
between SMs.
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 12 / 34
Distributed Memory

∎ Distributed memory systems require a communication network to

connect inter-processor memory.

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 13 / 34
Distributed Memory

∎ Distributed memory systems require a communication network to

connect inter-processor memory.
∎ Processors have their own local memory and operate independently.

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 13 / 34
Distributed Memory

∎ Distributed memory systems require a communication network to

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 13 / 34
Distributed Memory

∎ Distributed memory systems require a communication network to

connect inter-processor memory.
∎ Processors have their own local memory and operate independently.
∎ Memory addresses in one processor do not map to another processor,
so there is no concept of global address space across all processors.
∎ Data exchange between processors is managed by the programmer ,
not by the hardware.
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 13 / 34
Hybrid Distributed-Shared Memory

∎ The largest and fastest computers in the world today employ both
shared and distributed memory architectures.

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 14 / 34
Hybrid Distributed-Shared Memory

∎ The largest and fastest computers in the world today employ both
shared and distributed memory architectures.
∎ Current trends seem to indicate that this type of memory architecture
will continue to prevail.
∎ While this model allows for applications to scale, it increases the
complexity of writing computer programs.
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 14 / 34
Outline

1. Hardware architecture and concurrency

2. Parallel programming patterns

3. Concurrency platforms

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 15 / 34
Divide-and-Conquer

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 16 / 34
Divide-and-Conquer and Fork-Join

∎ Fork: divide problem and

execute separate calls in
parallel
Fork

Join

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 17 / 34
Divide-and-Conquer and Fork-Join

∎ Fork: divide problem and

execute separate calls in
parallel
Fork

∎ Join: merge parallel

execution back into serial

Join

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 17 / 34
Divide-and-Conquer and Fork-Join

∎ Fork: divide problem and

execute separate calls in
parallel
Fork

∎ Join: merge parallel

execution back into serial

∎ Recursively applying
fork-join can “easily”
Join parallelize a
divide-and-conquer
algorithm

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 17 / 34
Map
∎ Simultaneously execute a function on each data item in a collection

Input

Data Item

Function Execution

Output

Input

Data Item

Function Execution

Output

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 18 / 34
Map
∎ Simultaneously execute a function on each data item in a collection
∎ If more data items than threads, apply the pattern block-wise:
(1) partition the collection, and (2) apply one thread to each part
∎ This pattern is often simplified as just a parallel_for loop

Input

Data Item

Function Execution

Output

Data Item

Function Execution

Output

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 18 / 34
Workpile
∎ Workpile generalizes map pattern to a queue of tasks

Input

...

Function Execution
...

...
Output

Input

...

Function Execution
...

...
Output

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 19 / 34
Workpile
∎ Workpile generalizes map pattern to a queue of tasks
∎ Tasks in-flight can add new tasks to input queue
∎ Threads take tasks from queue until it is empty

Input

...

Function Execution
...

...
Output

Input

...

Function Execution
...

...
Output

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 19 / 34
Reduction

∎ A reduction combines every element in a collection into one element,

using an associative operator.

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 20 / 34
Reduction

∎ A reduction combines every element in a collection into one element,

using an associative operator.

∎ Example: computing the sum (or product) of 𝑛 matrices.

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 20 / 34
Reduction

∎ A reduction combines every element in a collection into one element,

using an associative operator.

∎ Example: computing the sum (or product) of 𝑛 matrices.

∎ Grouping the operations is often needed to allow for parallelism.

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 20 / 34
Reduction

∎ A reduction combines every element in a collection into one element,

using an associative operator.

∎ Example: computing the sum (or product) of 𝑛 matrices.

∎ Grouping the operations is often needed to allow for parallelism.

∎ This grouping requires associativity, but not commutativity.

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 20 / 34
Producer-Consumer

∎ Two functions connected by a queue

Data Queue
Producer Consumer

...

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 21 / 34
Producer-Consumer

∎ Two functions connected by a queue

∎ The producer produces data items, pushing them to the queue

Data Queue
Producer Consumer

...

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 21 / 34
Producer-Consumer

∎ Two functions connected by a queue

∎ The producer produces data items, pushing them to the queue

∎ The consumer processes data items, pulling them from the queue

Data Queue
Producer Consumer

...

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 21 / 34
Producer-Consumer

∎ Two functions connected by a queue

∎ The producer produces data items, pushing them to the queue

∎ The consumer processes data items, pulling them from the queue

∎ Producer and consumer execute simultaneously; at least one must be

active at all times Ô⇒ no deadlock

Data Queue
Producer Consumer

...

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 21 / 34
Producer-Consumer

∎ Two functions connected by a queue

∎ The producer produces data items, pushing them to the queue

∎ The consumer processes data items, pulling them from the queue

∎ Producer and consumer execute simultaneously; at least one must be

active at all times Ô⇒ no deadlock

Data Queue
Producer Consumer

...

∎ In some circumstances, the producer may be considered

as an iterator or generator
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 21 / 34
Pipeline
∎ A sequence of stages where the output of one stage is used as the
input to another

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 22 / 34
Pipeline
∎ A sequence of stages where the output of one stage is used as the
input to another
∎ Example: in a pipelined processor, instructions flow through the
central processing unit (CPU) in stages (Instruction Fetch, Decode,
Execute, etc.)
∎ Two consecutive stages form a producer-consumer pair
∎ Internal stages are both producer and consumer

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 22 / 34
Pascal triangle construction: a stencil computation

0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1
1 2 3 4 5 6 7 8
1 3 6 10 15 21 28
4 56
1 10 20 35
1 5 15 35 70
1 6 21 56
1 7 28
1 8

∎ Stencil computations are a class of data processing techniques which

update array elements according to a pattern

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 23 / 34
Pascal triangle construction: a stencil computation

0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1
1 2 3 4 5 6 7 8
1 3 6 10 15 21 28
4 56
1 10 20 35
1 5 15 35 70
1 6 21 56
1 7 28
1 8

∎ Stencil computations are a class of data processing techniques which

update array elements according to a pattern
∎ Construction of the Pascal Triangle: nearly the simplest stencil
computation!
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 23 / 34
Divide and conquer: principle

I II
I II

II
II III

∎ Each triangle region can be computed as a square region followed by

two (concurrent) triangle regions.

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 24 / 34
Divide and conquer: principle

I II
I II

II
II III

∎ Each triangle region can be computed as a square region followed by

two (concurrent) triangle regions.
∎ Each square region can also be computed in a divide and conquer
manner.

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 24 / 34
Blocking strategy: principle

0 0 0 0 0 0 0 0

a7
4
a6 1 2 3
a5

a4 2 3 4
a3
4
a2 3
a1
4
a0

∎ Let 𝐵 be the order of a block and 𝑛 be the number of elements.

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 25 / 34
Blocking strategy: principle

0 0 0 0 0 0 0 0

a7
4
a6 1 2 3
a5

a4 2 3 4
a3
4
a2 3
a1
4
a0

∎ Let 𝐵 be the order of a block and 𝑛 be the number of elements.

∎ Each block is processed serially (as a task) and the set of all blocks is
computed concurrently.

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 25 / 34
Outline

1. Hardware architecture and concurrency

2. Parallel programming patterns

3. Concurrency platforms

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 26 / 34
Programming patterms in Julia

_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _‘ | |
| | |_| | | | (_| | | Version 1.7.1 (2021-12-22)
_/ |\__’_|_|_|\__’_| | Official https://julialang.org/ release
|__/ |

julia> map(x -> x * 2, [1, 2, 3])

3-element Vector{Int64}:
2
4
6

julia> mapreduce(x->x^2, +, [1:3;])

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 27 / 34
Julia
1 f u n c t i o n pmap (f , lst )
2 np = nprocs () # the number of processes available
3 n = length ( lst )
4 results = Vector { Any }( n )
5 i = 1
6 # f u n c t i o n to produce the next work item from the queue .
7 nextidx () = ( idx = i ; i + = 1 ; idx )
8 @sync begin
9 for p = 1 : np
0 if p ! = myid () || np = = 1
1 @async begin
2 while true
3 idx = nextidx ()
4 if idx > n
5 break
6 end
7 results [ idx ] = r e m o t ec a l l _ f e t c h (f ,p , lst [ idx ])
8 end
9 end
0 end
1 end
2 end
3 results
4 end
Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 28 / 34
Fork-Join with Cilk

int fib(int n)
{
if (n < 2) return n;
int x, y;
x = cilk_spawn fib(n-1);
y = fib(n-2);
cilk_sync;
return x+y;
}

∎ The named child function cilk_spawn fib(n-1) may execute in

parallel with its parent
∎ Cilk keywords cilk_spawn and cilk_sync grant permissions for
parallel execution. They do not command parallel execution.
∎ Visit https://www.opencilk.org/

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 29 / 34
Cilk

∎ Cilk has been developed since 1994 at the MIT Laboratory for
Computer Science by Prof. Charles E. Leiserson and his group, in
particular by Matteo Frigo and Tao B. Schardl

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 30 / 34
Cilk

∎ Cilk has been developed since 1994 at the MIT Laboratory for
Computer Science by Prof. Charles E. Leiserson and his group, in
particular by Matteo Frigo and Tao B. Schardl
∎ Cilk is a multithreaded language for parallel programming that
generalizes the semantics of C (resp. C++) by introducing linguistic
constructs for parallel control.
∎ Cilk is a faithful extension of C (resp. C++). That is, the C (resp.
C++) elision of a Cilk program is a correct implementation of the
semantics of that program.
∎ Cilk’s scheduler maps strands onto processors dynamically at
runtime, using the work-stealing principle. Under reasonable
assumptions, this provives a guarantee of performance.

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 30 / 34
Cilk

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 30 / 34
Heterogeneous programming with CUDA
∎ The parallel code is written for a thread

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 31 / 34
Heterogeneous programming with CUDA
∎ The parallel code is written for a thread
ë Each thread is free to execute a unique code path
ë Built-in thread and block ID variables are used to map each thread
to a specific data tile (see next slide).

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 31 / 34
CUDA Example: increment array elements (1/2)

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 32 / 34
CUDA Example: increment array elements (2/2)

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 33 / 34
References
[1] M. McCool, J. Reinders, and A. Robison. Structured parallel
programming: patterns for efficient computation. Elsevier, 2012.
[2] J. E. Savage. Models of computation - exploring the power of
computing. Addison-Wesley, 1998. isbn: 978-0-201-89539-1.
[3] M. L. Scott. Programming Language Pragmatics (3. ed.) Academic
Press, 2009. isbn: 978-0-12-374514-9.
[4] A. Williams. C++ concurrency in action: practical multithreading; 1st
ed. Shelter Island, NY: Manning Publ., 2012. url:
https://cds.cern.ch/record/1483005.

Marc Moreno Maza Introduction to parallel and distributed computing CS4402 - CS9635 34 / 34

Digital Banking Question and Answer
No ratings yet
Digital Banking Question and Answer
16 pages
Embedded C Program
No ratings yet
Embedded C Program
10 pages
Slides Taken From: Parallel Computing Platforms
No ratings yet
Slides Taken From: Parallel Computing Platforms
11 pages
Paralle Processing in Brief
No ratings yet
Paralle Processing in Brief
31 pages
Parallel & Distributed Computing
No ratings yet
Parallel & Distributed Computing
47 pages
Parallel Distributed Computing
No ratings yet
Parallel Distributed Computing
64 pages
Intro Parallel Computing PDF
No ratings yet
Intro Parallel Computing PDF
58 pages
Theory of Distributed Computing and Parallel Processing With Its Applications, Advantages and Disadvantages
No ratings yet
Theory of Distributed Computing and Parallel Processing With Its Applications, Advantages and Disadvantages
11 pages
Module 1 - New
No ratings yet
Module 1 - New
59 pages
BDS Session 2
No ratings yet
BDS Session 2
56 pages
2 Parallel Computer Memory Architectures
No ratings yet
2 Parallel Computer Memory Architectures
26 pages
Chapter 5 - Shared Memory Multiprocessor
No ratings yet
Chapter 5 - Shared Memory Multiprocessor
96 pages
Unit 1
No ratings yet
Unit 1
25 pages
2-INTRODUCTION TO PDC - MOTIVATION - KEY CONCEPTS-03-Dec-2019Material - I - 03-Dec-2019 - Module - 1 PDF
No ratings yet
2-INTRODUCTION TO PDC - MOTIVATION - KEY CONCEPTS-03-Dec-2019Material - I - 03-Dec-2019 - Module - 1 PDF
63 pages
Lecture 3
No ratings yet
Lecture 3
9 pages
Lecture-4 Parallel hardware-Jameel-NNL
No ratings yet
Lecture-4 Parallel hardware-Jameel-NNL
39 pages
Introduction To Multicore Programming: University of Western Ontario, London, Ontario (Canada)
No ratings yet
Introduction To Multicore Programming: University of Western Ontario, London, Ontario (Canada)
60 pages
Parallel Computing
No ratings yet
Parallel Computing
57 pages
Inf3380 para HW 2014
No ratings yet
Inf3380 para HW 2014
28 pages
07 Introduction To Multicore Programming PDF
No ratings yet
07 Introduction To Multicore Programming PDF
60 pages
Introduction To Parallel Computing LLNL
No ratings yet
Introduction To Parallel Computing LLNL
44 pages
Comp422 534 2020 Lecture1 Introduction
No ratings yet
Comp422 534 2020 Lecture1 Introduction
49 pages
Co-1 (2)
No ratings yet
Co-1 (2)
66 pages
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
No ratings yet
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
22 pages
Introduction To Parallel Computing: John Von Neumann Institute For Computing
No ratings yet
Introduction To Parallel Computing: John Von Neumann Institute For Computing
18 pages
APznzabMSGRiAQ8A6MYm6rveAifgi1HxTbiTS9Yf85jZUPqJgWxkujRhNKxar3EMmdUmkYBO7lY9cgFKwY4fwAkv2bcmoL6bQOuYWj_ptvmKvZa7LIHiGWTA-SGiv4ZX1G6v7akwnOUhTbDF77ogwOam9w3m9razgp9_G3AN8-n7pGnvYDhIz5LR3pHaezRf34N7xBAUUWK5LTsnzw1
No ratings yet
APznzabMSGRiAQ8A6MYm6rveAifgi1HxTbiTS9Yf85jZUPqJgWxkujRhNKxar3EMmdUmkYBO7lY9cgFKwY4fwAkv2bcmoL6bQOuYWj_ptvmKvZa7LIHiGWTA-SGiv4ZX1G6v7akwnOUhTbDF77ogwOam9w3m9razgp9_G3AN8-n7pGnvYDhIz5LR3pHaezRf34N7xBAUUWK5LTsnzw1
31 pages
Multi Processors and Thread Level Parallelism
No ratings yet
Multi Processors and Thread Level Parallelism
74 pages
Multiprocessors and Multicomputers
No ratings yet
Multiprocessors and Multicomputers
27 pages
Cours 1
No ratings yet
Cours 1
38 pages
Parallel Processor Computing Unit 1
No ratings yet
Parallel Processor Computing Unit 1
10 pages
comporg6_ch12
No ratings yet
comporg6_ch12
36 pages
W3C1 Principles of Parallel Computing
No ratings yet
W3C1 Principles of Parallel Computing
28 pages
Parallel_computing
No ratings yet
Parallel_computing
32 pages
Overview of Parallel Computing: Shawn T. Brown
No ratings yet
Overview of Parallel Computing: Shawn T. Brown
46 pages
Parallel Computer Architecture A Hardware-Software
No ratings yet
Parallel Computer Architecture A Hardware-Software
18 pages
lecture 3
No ratings yet
lecture 3
16 pages
Basics of Parallel Programming: Unit-1
No ratings yet
Basics of Parallel Programming: Unit-1
79 pages
Unit 1 - Part - 2
No ratings yet
Unit 1 - Part - 2
30 pages
CS 258 Parallel Computer Architecture: CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley
No ratings yet
CS 258 Parallel Computer Architecture: CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley
44 pages
Chapter 1 - Parallel Architectures
No ratings yet
Chapter 1 - Parallel Architectures
60 pages
Memory in Multiprocessor System
No ratings yet
Memory in Multiprocessor System
52 pages
PP Cuda Unit1 1
No ratings yet
PP Cuda Unit1 1
77 pages
CS4961 Parallel Programming: Course Details
No ratings yet
CS4961 Parallel Programming: Course Details
7 pages
Lecture 1 Introduction To PDC
No ratings yet
Lecture 1 Introduction To PDC
17 pages
Computer Science 146 Computer Architecture
No ratings yet
Computer Science 146 Computer Architecture
18 pages
CICS 504 Computer Organization
No ratings yet
CICS 504 Computer Organization
35 pages
Parallel Processors: Session 2
No ratings yet
Parallel Processors: Session 2
32 pages
RG1-Intro-ParallelArch-HPCAI-Jan2020
No ratings yet
RG1-Intro-ParallelArch-HPCAI-Jan2020
47 pages
Parallel and Distributed Computing
No ratings yet
Parallel and Distributed Computing
90 pages
Lecture-13-14 Parallel and Distributed Systems Programming Models-Jameel
No ratings yet
Lecture-13-14 Parallel and Distributed Systems Programming Models-Jameel
70 pages
1.Introduction
No ratings yet
1.Introduction
65 pages
Cours 1
No ratings yet
Cours 1
38 pages
BDS-Session-2
No ratings yet
BDS-Session-2
58 pages
MODULE 4 hpc
No ratings yet
MODULE 4 hpc
41 pages
Parcomp PDF
No ratings yet
Parcomp PDF
94 pages
Par Comp
No ratings yet
Par Comp
94 pages
Intro To Parallel Computing
No ratings yet
Intro To Parallel Computing
127 pages
KCS 713 Unit 1 Lecture 5
No ratings yet
KCS 713 Unit 1 Lecture 5
32 pages
CS6461 - Computer Architecture Fall 2016: Morris Lancaster - Lecturer
No ratings yet
CS6461 - Computer Architecture Fall 2016: Morris Lancaster - Lecturer
58 pages
Distributed Operating Syst EM: 15SE327E Unit 1
No ratings yet
Distributed Operating Syst EM: 15SE327E Unit 1
49 pages
Memory Basics Explained
From Everand
Memory Basics Explained
Alisa Turing
No ratings yet
Mega Drive Architecture: Architecture of Consoles: A Practical Analysis, #3
From Everand
Mega Drive Architecture: Architecture of Consoles: A Practical Analysis, #3
Rodrigo Copetti
No ratings yet
Micro-Programmed Control
No ratings yet
Micro-Programmed Control
18 pages
1st Hour Break Out
No ratings yet
1st Hour Break Out
14 pages
ENISA Threat Landscape For Supply Chain Attacks
No ratings yet
ENISA Threat Landscape For Supply Chain Attacks
57 pages
IoT Unit-V Service layer protocols and Security
No ratings yet
IoT Unit-V Service layer protocols and Security
6 pages
Name-Anshul Panwar ROLL NO - 200040024: Your Paper
No ratings yet
Name-Anshul Panwar ROLL NO - 200040024: Your Paper
5 pages
Bad Piggies Rebooted Command System Overview
100% (1)
Bad Piggies Rebooted Command System Overview
3 pages
DIAD Demo Script - PowerBI Desktop
No ratings yet
DIAD Demo Script - PowerBI Desktop
7 pages
TOPServer S71500 Communication Guide
No ratings yet
TOPServer S71500 Communication Guide
36 pages
UNIT I Complete
No ratings yet
UNIT I Complete
12 pages
Y86 Programmer-Visible State
No ratings yet
Y86 Programmer-Visible State
14 pages
Agile - Definition of Ready (DoR) & Definition of Done (DoD)
No ratings yet
Agile - Definition of Ready (DoR) & Definition of Done (DoD)
10 pages
Timing Diagram
No ratings yet
Timing Diagram
37 pages
Chapter 10 Performance of Audit Work
No ratings yet
Chapter 10 Performance of Audit Work
3 pages
Cisco Paket Tracer 2.9.1
No ratings yet
Cisco Paket Tracer 2.9.1
3 pages
Cyber Security
No ratings yet
Cyber Security
4 pages
Test AKRUTI-FUNDA-2
No ratings yet
Test AKRUTI-FUNDA-2
24 pages
Industrial Indicator: CI-1500A/1560A Service Manual
No ratings yet
Industrial Indicator: CI-1500A/1560A Service Manual
29 pages
Using EVE NG With Juniper Topologies
No ratings yet
Using EVE NG With Juniper Topologies
25 pages
Bahasa Inggris - Malicious Software
No ratings yet
Bahasa Inggris - Malicious Software
10 pages
Linux-PAM SAG
No ratings yet
Linux-PAM SAG
94 pages
The Binary Number System: CSE 1110: Introduction To Computer Systems
No ratings yet
The Binary Number System: CSE 1110: Introduction To Computer Systems
18 pages
Compendium Training 421 Ckad Certified Kubernetes Application Developer
No ratings yet
Compendium Training 421 Ckad Certified Kubernetes Application Developer
3 pages
Unit 4 Design
No ratings yet
Unit 4 Design
60 pages
V EEE EE3017 Lab Mnual
No ratings yet
V EEE EE3017 Lab Mnual
72 pages
Computer Organization - MIIT, Mandalay - Mouli Sankaran
100% (1)
Computer Organization - MIIT, Mandalay - Mouli Sankaran
41 pages
Encoder - Van Toc - Gia Toc
No ratings yet
Encoder - Van Toc - Gia Toc
5 pages
Asm1 Part1 Networking Pham Minh Hieu Bh01671
No ratings yet
Asm1 Part1 Networking Pham Minh Hieu Bh01671
52 pages
Implementing Virtual Machine APerformance Evaluation
No ratings yet
Implementing Virtual Machine APerformance Evaluation
10 pages

Introduction to Parallel and Distributed Computing

Uploaded by

Introduction to Parallel and Distributed Computing

Uploaded by

Introduction to parallel and distributed computing

Marc Moreno Maza

Ontario Research Center for Computer Algebra

CS4402 - CS9635, January 30, 2024

Marc Moreno Maza

Ontario Research Center for Computer Algebra

CS4402 - CS9635, January 30, 2024

1. Hardware architecture and concurrency

2. Parallel programming patterns

1. Hardware architecture and concurrency

2. Parallel programming patterns

∎ In the 1980’s, a memory access and a CPU operation were both as

∎ In the 1980’s, a memory access and a CPU operation were both as

∎ In the 1980’s, a memory access and a CPU operation were both as

∎ Data moves in blocks (cache-lines, pages) between levels

∎ Data moves in blocks (cache-lines, pages) between levels

∎ Data moves in blocks (cache-lines, pages) between levels

Image taken from Hennessy, Patterson. Computer Architecture, a

∎ A GPU consists of a scheduler, a large shared memory and several

∎ A GPU consists of a scheduler, a large shared memory and several

∎ Distributed memory systems require a communication network to

∎ Distributed memory systems require a communication network to

∎ Distributed memory systems require a communication network to

∎ Distributed memory systems require a communication network to

1. Hardware architecture and concurrency

2. Parallel programming patterns

∎ Fork: divide problem and

∎ Fork: divide problem and

∎ Join: merge parallel

∎ Fork: divide problem and

∎ Join: merge parallel

∎ A reduction combines every element in a collection into one element,

∎ A reduction combines every element in a collection into one element,

∎ Example: computing the sum (or product) of 𝑛 matrices.

∎ A reduction combines every element in a collection into one element,

∎ Example: computing the sum (or product) of 𝑛 matrices.

∎ Grouping the operations is often needed to allow for parallelism.

∎ A reduction combines every element in a collection into one element,

∎ Example: computing the sum (or product) of 𝑛 matrices.

∎ Grouping the operations is often needed to allow for parallelism.

∎ This grouping requires associativity, but not commutativity.

∎ Two functions connected by a queue

∎ Two functions connected by a queue

∎ The producer produces data items, pushing them to the queue

∎ Two functions connected by a queue

∎ The producer produces data items, pushing them to the queue

∎ Two functions connected by a queue

∎ The producer produces data items, pushing them to the queue

∎ Producer and consumer execute simultaneously; at least one must be

∎ Two functions connected by a queue

∎ The producer produces data items, pushing them to the queue

∎ Producer and consumer execute simultaneously; at least one must be

∎ In some circumstances, the producer may be considered

∎ Stencil computations are a class of data processing techniques which

∎ Stencil computations are a class of data processing techniques which

∎ Each triangle region can be computed as a square region followed by

∎ Each triangle region can be computed as a square region followed by

∎ Let 𝐵 be the order of a block and 𝑛 be the number of elements.

∎ Let 𝐵 be the order of a block and 𝑛 be the number of elements.

1. Hardware architecture and concurrency

2. Parallel programming patterns

julia> map(x -> x * 2, [1, 2, 3])

julia> mapreduce(x->x^2, +, [1:3;])

∎ The named child function cilk_spawn fib(n-1) may execute in

You might also like