0% found this document useful (0 votes)

3 views

14 Parallel Algorithms CUDA Basics s20

Uploaded by

csestaff23

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

14 Parallel Algorithms CUDA Basics s20

Uploaded by

csestaff23

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 89

Parallel Architectures

Parallel Algorithms
CUDA
Chris Rossbach
cs378h
Outline for Today

• Questions?
• Administrivia
• Eldar-* machines should be available
• Agenda
• Parallel Algorithms
• CUDA

• Acknowledgements:
http://developer.download.nvidia.com/compute/develo
pertrainingmaterials/presentations/cuda_language/Intro
duction_to_CUDA_C.pptx

2
Faux Quiz Questions
• What is a reduction? A prefix sum? Why are they hard to parallelize and what basic techniques
can be used to parallelize them?
• Define flow dependence, output dependence, and anti-dependence: give an example of each.
Why/how do compilers use them to detect loop-independent vs loop-carried dependences?
• What is the difference between a thread-block and a warp?
• How/Why must programmers copy data back and forth to a GPU?
• What is “shared memory” in CUDA? Describe a setting in which it might be useful.
• CUDA kernels have implicit barrier synchronization. Why is __syncthreads() necessary in light of
this fact?
• How might one implement locks on a GPU?
• What ordering guarantees does a GPU provide across different hardware threads’ access to a
single memory location? To two disjoint locations?
• When is it safe for one GPU thread to wait (e.g. by spinning) for another?

3
Review: what is a vector processor?
Dont decode same instruction
over and over…
Implementation:
• Instruction fetch control logic shared
• Same instruction stream executed on
• Multiple pipelines
• Multiple different operands in parallel

4
When does vector processing help?

What are the potential bottlenecks here?

When can it improve throughput?

Only helps if memory can keep the pipeline busy!

Hardware multi-threading
• Address memory bottleneck
• Share exec unit across
• Instruction streams
• Switch on stalls
• Looks like multiple cores to the OS
• Three variants:
• Coarse
• Fine-grain
• Simultaneous
Running example

Thread A Thread B Thread C Thread D

• Colors → pipeline full

• White → stall
Coarse- grained multithreading
• Single thread runs until a costly stall
• E.g. 2nd level cache miss
• Another thread starts during stall
• Pipeline fill time requires several cycles!
• Hardware support required
• PC and register file for each thread
• Looks like another physical CPU to
OS/software

Pros? Cons?
Fine-grained multithreading
• Threads interleave instructions
• Round-robin
• Skip stalled threads
• Hardware support required
• Separate PC and register file per thread
• Hardware to control alternating pattern
• Naturally hides delays
• Data hazards, Cache misses
• Pipeline runs with rare stalls

Pros? Cons?
Simultaneous Multithreading (SMT)
• Instructions from multiple threads
issued on same cycle
• Uses register renaming
• dynamic scheduling facility of multi- Skip C
issue architecture
• Hardware support:
• Register files, PCs per thread Skip A
• Temporary result registers pre commit
• Support to sort out which threads get
results from which instructions

Pros? Cons?
Why Vector and Multithreading Background?
GPU:
• A very wide vector machine
• Massively multi-threaded to hide memory latency
• Originally designed for graphics pipelines…
Graphics ~= Rendering
Inputs
• 3D world model(objects, materials)
• Geometry modeled w triangle meshes, surface normals
• GPUs subdivide triangles into “fragments” (rasterization)
• Materials modeled with “textures”
• Texture coordinates, sampling “map” textures →
geometry
• Light locations and properties
• Attempt to model surtface/light interactions with
modeled objects/materials
• View point

Output
• 2D projection seen from the view-point
3/8/2020 12
Grossly over-simplified rendering algorithm
foreach(vertex v in model)
map vmodel → vview
fragment[] frags = {};
foreach triangle t (v0, v1, v2)
frags.add(rasterize(t));
foreach fragment f in frags
choose_color(f);
display(visible_fragments(frags));

3/8/2020 Dandelion 13
Algorithm → Graphics Pipeline
foreach(vertex v in model)
map vmodel → vview
fragment[] frags = {};
foreach triangle t (v0, v1, v2)
frags.add(rasterize(t));
foreach fragment f in frags
choose_color(f);
display(visible_fragments(frags));
OpenGL pipeline
To first order, DirectX looks the same!

3/8/2020 Dandelion 14
Graphics pipeline → GPU architecture

Limited “programmability” of shaders:

GeForce 6 series
Minimal/no control flow
3/8/2020 Maximum instruction count Dandelion 15
Late Modernity: unified shaders

Mapping to Graphics pipeline no longer apparent

Processing elements no longer specialized to a particular role
3/8/2020 Model supports real control flow, larger instr count
Dandelion 16
Mostly Modern: Pascal
Definitely Modern: Turing
Cross-generational GPU observations
GPUs designed for parallelism in • Simple cores
graphics pipeline:
• Data • Single instruction stream
• Per-vertex Even as• GPU
Vector instructions
architectures become (SIMD)
more OR
• Per-fragment • certain
general, Implicit HW-managed
assumptions persist: sharing
• Per-pixel 1. Data parallelism
(SIMT) is trivially exposed
• Task 2. All problems look like painting a box
• Vertex processing • with
Hide memory
colored dots latency with
• Fragment processing
• Rasterization
HW multi-threading
• Hidden-surface elimination But what if my problem isn’t
• MLP painting a box?!!?!
• HW multi-threading for hiding memory
latency

3/8/2020 Dandelion 20
Programming Model
• GPUs are I/O devices, managed by user-code
• “kernels” == “shader programs”
• 1000s of HW-scheduled threads per kernel
• Threads grouped into independent blocks.
• Threads in a block can synchronize (barrier)
• This is the *only* synchronization
• “Grid” == “launch” == “invocation” of a kernel
• a group of blocks (or warps)
Need codes that are 1000s-X
parallel….
3/8/2020 21
Parallel Algorithms
• Sequential algorithms often do not permit easy parallelization
• Does not mean there work has no parallelism
• A different approach can yield parallelism
• but often changes the algorithm
• Parallelizing != just adding locks to a sequential algorithm
• Parallel Patterns
• Map
• Scatter, Gather
• Reduction If you can express your
algorithm using these patterns,
• Scan an apparently fundamentally
• Search, Sort sequential algorithm can be
made parallel
Map
• Inputs
• Array A
• Function f(x)
• map(A, f) → apply f(x) on all elements in A
• Parallelism trivially exposed
• f(x) can be applied in parallel to all elements, in principle

for(i=0; i<numPoints; i++) {

labels[i] = findNearestCenter(points[i]); map(points, findNearestCenter)
}
Why is this useful on
a box-drawing
Scatter and Gather machine?

• Gather:
• Read multiple items to single /packed location
• Scatter:
• Write single/packed data item to multiple locations
• Inputs: x, y, indeces, N

for (i=0; i<N; ++i)

x[i] = y[idx[i]]; gather(x, y, idx)

for (i=0; i<N; ++i) scatter(x, y, idx)

y[idx[i]] = x[i];
24
Reduce
• Input
• Associative operator op
• Ordered set s = [a, b, c, … z]
• Reduce(op, s) returns a op b op c … op z

for(i=0; i<N; ++i) {

accum += (point[i]*point[i]) accum = reduce(*, point)
}

Why must op be associative?

Scan (prefix sum)
• Input
• Associative operator op
• Ordered set s = [a, b, c, … z]
• Identity I

• scan(op, s) = [I, a, (a op b), (a op b op c) …]

• Scan is the workhorse of parallel algorithms:

• Sort, histograms, sparse matrix, string compare, …
GroupBy
• Group a collection by key
• Lambda function maps elements → key
var res = ints.GroupBy(x => x);

foreach(T
foreach(Telem
elemininPF(ints))
ints)
10 30 20 10 20 30 10 {{
key
key = =KeyLambda(elem);
KeyLambda(elem);

10 10 10 30 30 20 20 group
group= =GetGroup(key);
GetGroup(key);

group.Add(elem);
group.Add(elem);
}}

29
GroupBy using parallel primitives
10 30 20 10 20 30 10

Assign group IDs Sorting or hashing

10 20 30
Group ID : 0 1 2
Hash table lookup: group ID
Compute group sizes
-- Uses atomic increment
10 20 30
Group ID : 0 1 2
Group Size : 3 2 2

Compute start indices

prefix sum of group sizes
10 20 30
Group ID : 0 1 2
Group Start Index : 0 3 5
Write to output location
Write Outputs – Uses atomic increment
10 10 10 20 20 30 30 – Scatter gather

30
Sort
• OK, let’s build a parallel sort

31
Summary
Re-expressing apparently sequential algorithms as combinations of
parallel patterns is a common technique when targeting GPUs

• Reductions
• Scans
• Re-orderings (scatter/gather)
• Sort
• Map

32
What is CUDA?
• CUDA Architecture
• Expose GPU parallelism for general-purpose computing
• Retain performance

• CUDA C/C++
• Based on industry-standard C/C++
• Small set of extensions to enable heterogeneous programming
• Straightforward APIs to manage devices, memory etc.

33
Heterogeneous Computing

Blocks

Threads

Indexing
CONCEPTS
Shared memory

__syncthreads()

Asynchronous operation

Handling errors

Managing devices
34
CONCEPTS Heterogeneous Computing

Blocks

Threads

Indexing

Shared memory

__syncthreads()

Asynchronous operation

HELLO WORLD!
Handling errors

Managing devices
Heterogeneous Computing
▪ Terminology:
▪ Host The CPU and its memory (host memory)
▪ Device The GPU and its memory (device memory)

Host Device
36
Heterogeneous Computing
#include <iostream>
#include <algorithm>

using namespace std;

#define N 1024
#define RADIUS 3
#define BLOCK_SIZE 16

global void stencil_1d(int in, int out) {

__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
int gindex = threadIdx.x + blockIdx.x * blockDim.x;
int lindex = threadIdx.x + RADIUS;

// Read input elements into shared memory

temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {
temp[lindex - RADIUS] = in[gindex - RADIUS];
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}

parallel fn
// Synchronize (ensure all the data is available)
__syncthreads();

// Apply the stencil

int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
result += temp[lindex + offset];

// Store the result

out[gindex] = result;
}

void fill_ints(int *x, int n) {

fill_n(x, n, 1);
}

int main(void) {
int *in, *out; // host copies of a, b, c
int *d_in, *d_out; // device copies of a, b, c
int size = (N + 2*RADIUS) * sizeof(int);

// Alloc space for host copies and setup values

in = (int *)malloc(size); fill_ints(in, N + 2*RADIUS);
out = (int *)malloc(size); fill_ints(out, N + 2*RADIUS);

// Alloc space for device copies

cudaMalloc((void **)&d_in, size);

serial code
cudaMalloc((void **)&d_out, size);

// Copy to device
cudaMemcpy(d_in, in, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_out, out, size, cudaMemcpyHostToDevice);

// Launch stencil_1d() kernel on GPU

stencil_1d<<<N/BLOCK_SIZE,BLOCK_SIZE>>>(d_in + RADIUS,
d_out + RADIUS);

// Copy result back to host

parallel code
cudaMemcpy(out, d_out, size, cudaMemcpyDeviceToHost);

// Cleanup
free(in); free(out);
cudaFree(d_in); cudaFree(d_out);
return 0;

serial code
}

37
Simple Processing Flow

PCI Bus

1. Copy input data from CPU memory to

GPU memory

38
Simple Processing Flow

PCI Bus

1. Copy input data from CPU memory to

GPU memory
2. Load GPU program and execute,
caching data on chip for performance

39
Simple Processing Flow

PCI Bus

1. Copy input data from CPU memory to

GPU memory
2. Load GPU program and execute,
caching data on chip for performance
3. Copy results from GPU memory to
CPU memory

40
Hello World!
int main(void) {
printf("Hello World!\n");
return 0;
}
Output:
Standard C that runs on the host
$ nvcc
hello_world.
NVIDIA compiler (nvcc) can be used cu
to compile programs with no device $ a.out
code Hello World!
$

41
Hello World! with Device Code
__global__ void mykernel(void) {
}

int main(void) {
mykernel<<<1,1>>>();
printf("Hello World!\n");
return 0;
}

▪ Two new syntactic elements…

42
Hello World! with Device Code
__global__ void mykernel(void) {
}

• CUDA C/C++ keyword global indicates a function that:

• Runs on the device
• Is called from host code

• nvcc separates source code into host and device components

• Device functions (e.g. mykernel()) processed by NVIDIA compiler
• Host functions (e.g. main()) processed by standard host compiler
• gcc, cl.exe

43
Hello World! with Device COde
mykernel<<<1,1>>>();

• Triple angle brackets mark a call from host code to device code
• Also called a “kernel launch”
• We’ll return to the parameters (1,1) in a moment

• That’s all that is required to execute a function on the GPU!

44
Hello World! with Device Code
__global__ void mykernel(void){
}

Output:
int main(void) {
mykernel<<<1,1>>>();
$ nvcc
printf("Hello World!\n");
hello.cu
return 0;
$ a.out
}
Hello World!
$
• mykernel() does nothing,
somewhat anticlimactic!

45
Parallel Programming in CUDA C/C++
• But wait… GPU computing is about
massive parallelism!

• We need a more interesting example…

• We’ll start by adding two integers and

build up to vector addition

a b c

46
Addition on the Device
• A simple kernel to add two integers

global void add(int a, int b, int *c) {

*c = *a + *b;
}

• As before global is a CUDA C/C++ keyword meaning

• add() will execute on the device
• add() will be called from the host

47
Addition on the Device
• Note that we use pointers for the variables

global void add(int a, int b, int *c) {

*c = *a + *b;
}

• add() runs on the device, so a, b and c must point to device memory

• We need to allocate memory on the GPU

48
Memory Management
• Host and device memory are separate entities
• Device pointers point to GPU memory
May be passed to/from host code
May not be dereferenced in host code
• Host pointers point to CPU memory
May be passed to/from device code
May not be dereferenced in device code

• Simple CUDA API for handling device memory

• cudaMalloc(), cudaFree(), cudaMemcpy()
• Similar to the C equivalents malloc(), free(), memcpy()

49
Addition on the Device: add()
• Returning to our add() kernel

global void add(int a, int b, int *c) {

*c = *a + *b;
}

• Let’s take a look at main()…

50
Addition on the Device: main()
int main(void) {
int a, b, c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = sizeof(int);

// Allocate space for device copies of a, b, c

cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);

// Setup input values

a = 2;
b = 7;

51
Addition on the Device: main()
// Copy inputs to device
cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU

add<<<1,1>>>(d_a, d_b, d_c);

// Copy result back to host

cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost);

// Cleanup
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}

52
CONCEPTS Heterogeneous Computing

Blocks

Threads

Indexing

Shared memory

__syncthreads()

Asynchronous operation

RUNNING IN
Handling errors

Managing devices

PARALLEL

53
Moving to Parallel
• GPU computing is about massive parallelism
• So how do we run code in parallel on the device?

add<<< 1, 1 >>>();

add<<< N, 1 >>>();

• Instead of executing add() once, execute N times in parallel

54
Vector Addition on the Device
• With add() running in parallel we can do vector addition

• Terminology: each parallel invocation of add() is a block

• The set of blocks is referred to as a grid
• Each invocation can refer to its block index using blockIdx.x

global void add(int a, int b, int *c) {

c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}

• By using blockIdx.x to index into the array, each block handles a

different index

55
Vector Addition on the Device
__global__ void add(int *a, int *b, int *c) {
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}

• On the device, each block can execute in parallel:

Block 0 Block 1 Block 2 Block 3

c[0] = a[0] + b[0]; c[1] = a[1] + b[1]; c[2] = a[2] + b[2]; c[3] = a[3] + b[3];

56
Vector Addition on the Device: add()
• Returning to our parallelized add() kernel

global void add(int a, int b, int *c) {

c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}

• Let’s take a look at main()…

57
Vector Addition on the Device: main()
#define N 512
int main(void) {
int *a, *b, *c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = N * sizeof(int);

// Alloc space for device copies of a, b, c

cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);

// Alloc space for host copies of a, b, c and setup input values

a = (int *)malloc(size); random_ints(a, N);
b = (int *)malloc(size); random_ints(b, N);
c = (int *)malloc(size);

58
Vector Addition on the Device: main()
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU with N blocks

add<<<N,1>>>(d_a, d_b, d_c);

// Copy result back to host

cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}

59
Review
• Difference between host and device • Basic device memory
• Host CPU management
• Device GPU • cudaMalloc()
• cudaMemcpy()
• cudaFree()
• __global__ declares device code
• Executes on the device
• Called from the host • Launching parallel kernels
• Launch N copies of add() with
add<<<N,1>>>(…);
• Passing parameters from host code • Use blockIdx.x to access
to a device function block index

60
CONCEPTS Heterogeneous Computing

Blocks

Threads

Indexing

Shared memory

__syncthreads()

Asynchronous operation

INTRODUCING
Handling errors

Managing devices

THREADS

61
CUDA Threads
• Terminology: a block can be split into parallel threads

• Change add()to use parallel threads instead of parallel blocks:

__global__ void add(int *a, int *b, int *c) {
c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x];
}
• Use threadIdx.x instead of blockIdx.x

• Need to make one change in main()…

62
Vector Addition Using Threads: main()
#define N 512
int main(void) {
int *a, *b, *c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = N * sizeof(int);

// Alloc space for device copies of a, b, c

cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);

// Alloc space for host copies of a, b, c and setup input values

a = (int *)malloc(size); random_ints(a, N);
b = (int *)malloc(size); random_ints(b, N);
c = (int *)malloc(size);
63
Vector Addition Using Threads: main()
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU with N threads

add<<<1,N>>>(d_a, d_b, d_c);

// Copy result back to host

cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}
64
CONCEPTS Heterogeneous Computing

Blocks

Threads

Indexing

Shared memory

__syncthreads()

Asynchronous operation

Handling errors
COMBINING THREADS Managing devices

AND BLOCKS

65
Combining Blocks and Threads
• We’ve seen parallel vector addition using:
• Many blocks with one thread each
• One block with many threads

• Let’s adapt vector addition to use both blocks and threads

• Why? We’ll come to that…

• First let’s discuss data indexing…

66
Indexing Arrays with Blocks and Threads
• No longer as simple as using blockIdx.x and threadIdx.x
• Index an array with one elem. per thread (8 threads/block)

threadIdx.x threadIdx.x threadIdx.x threadIdx.x

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

blockIdx.x = 0 blockIdx.x = 1 blockIdx.x = 2 blockIdx.x = 3

• With M threads/block, unique index per thread is :

int index = threadIdx.x + blockIdx.x * M;

67
Indexing Arrays: Example
• Which thread will operate on the red element?

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

M = 8 threadIdx.x = 5

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

blockIdx.x = 2

int index = threadIdx.x + blockIdx.x * M;

= 5 + 2 * 8;
= 21;
68
Vector Addition with Blocks and Threads
• Use the built-in variable blockDim.x for threads per block
int index = threadIdx.x + blockIdx.x * blockDim.x;

• Combined add() using parallel threads and blocks

__global__ void add(int *a, int *b, int *c) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
c[index] = a[index] + b[index];
}

• What changes need to be made in main()?

69
Addition with Blocks and Threads:
main()
#define N (2048*2048)
#define THREADS_PER_BLOCK 512
int main(void) {
int *a, *b, *c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = N * sizeof(int);

// Alloc space for device copies of a, b, c

cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);

// Alloc space for host copies of a, b, c and setup input values

a = (int *)malloc(size); random_ints(a, N);
b = (int *)malloc(size); random_ints(b, N);
c = (int *)malloc(size);

70
Addition with Blocks and Threads:
main()
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU

add<<<N/THREADS_PER_BLOCK,,THREADS_PER_BLOCK>>>(d_a, d_b, d_c);

// Copy result back to host

cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}

71
Handling Arbitrary Vector Sizes
• Typical problems are not friendly multiples of
blockDim.x

• Avoid accessing beyond the end of the arrays:

__global__ void add(int *a, int *b, int *c, int n) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
if (index < n)
c[index] = a[index] + b[index];
}

• Update the kernel launch:

add<<<(N + M-1) / M, M>>>(d_a, d_b, d_c, N);

72
Why Bother with Threads?
• Threads seem unnecessary
• They add a level of complexity
• What do we gain?

• Unlike parallel blocks, threads have mechanisms to:

• Communicate
• Synchronize

• To look closer, we need a new example…

73
CONCEPTS Heterogeneous Computing

Blocks

Threads

Indexing

Shared memory

__syncthreads()

Asynchronous operation

COOPERATING
Handling errors

Managing devices

THREADS

75
Stencils
• Each pixel → function of neighbors
• Edge detection:

• Blur:

76
1D Stencil
• Consider 1D stencil over 1D array of elements
• Each output element is the sum of input elements within a radius

• Radius == 3 → each output element is sum of 7 input elements:

radius radius

77
Implementation within a block
• Each thread: process 1 output element __global__ void stencil_1d(int *in, int *out) {
// note: idx comp & edge conditions omitted…
• blockDim.x elements per block int result = 0;
for (int offset = -R; offset <= R; offset++)
result += in[idx + offset];
• Input elements read many times
• With radius 3, each input element is read // Store the result
seven times out[idx] = result;
}

78
Implementation within a block
• Each thread: process 1 output element __global__ void stencil_1d(int *in, int *out) {
// note: idx comp & edge conditions omitted…
• blockDim.x elements per block int result = 0;
for (int offset = -R; offset <= R; offset++)
result += in[idx + offset];
• Input elements read many times
• With radius 3, each input element is read // Store the result
seven times out[idx] = result;
}

Why is this a
problem?

79
Sharing Data Between Threads
• Terminology: within a block, threads share data via shared memory

• Extremely fast on-chip memory, user-managed

• Declare using shared, allocated per block

• Data is not visible to threads in other blocks

80
Stencil with Shared Memory
• Cache data in shared memory
– Read (blockDim.x + 2 * radius) elements from memory to shared
– Compute blockDim.x output elements
– Write blockDim.x output elements to global memory

– Each block needs a halo of radius elements at each boundary

halo on left halo on right

blockDim.x output elements 81

Stencil Kernel
__global__ void stencil_1d(int *in, int *out) {
__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
int gindex = threadIdx.x + blockIdx.x * blockDim.x;
int lindex = threadIdx.x + RADIUS;

// Read input elements into shared memory

temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {
temp[lindex - RADIUS] = in[gindex - RADIUS];
temp[lindex + BLOCK_SIZE] =
in[gindex + BLOCK_SIZE];
}
// Apply the stencil
int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
result += temp[lindex + offset];

// Store the result Are we

82 done?
out[gindex] = result;
}
Data Race!
▪ The stencil example will not work…

▪ Suppose thread 15 reads the halo before thread 0 has fetched it…

temp[lindex] = in[gindex]; Store at temp[18]

if (threadIdx.x < RADIUS) {
temp[lindex – RADIUS = in[gindex – RADIUS]; Skipped, threadIdx > RADIUS
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}

int result = 0;
result += temp[lindex + 1]; Load from temp[19]

83
__syncthreads()
• void __syncthreads();

• Synchronizes all threads within a block

– Used to prevent RAW / WAR / WAW hazards

• All threads must reach the barrier

– In conditional code, the condition must be uniform across the block

84
Correct Stencil Kernel
__global__ void stencil_1d(int *in, int *out) {
__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
int gindex = threadIdx.x + blockIdx.x * blockDim.x;
int lindex = threadIdx.x + RADIUS;

// Read input elements into shared memory

temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {
temp[lindex - RADIUS] = in[gindex - RADIUS];
temp[lindex + BLOCK_SIZE] =
in[gindex + BLOCK_SIZE];
}
__syncthreads();
// Apply the stencil
int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
result += temp[lindex + offset];
85
// Store the result
out[gindex] = result;
}
Notes on __syncthreads()
• void __syncthreads();
__global__ void some_kernel(int *in, int *out) {
• Synchronizes all threads within a block // good idea?
if(threadIdx.x == SOME_VALUE)
– Used to prevent RAW / WAR / WAW hazards __syncthreads();
}

• All threads must reach the barrier

– In conditional code, the condition must be uniform across the block
__device__ void lock_trick(int *in, int *out) {
__syncthreads();
if(myIndex == 0)
critical_section();
__syncthreads(); 86
}
Atomics
Race conditions –
• Traditional locks are to be avoided
• How do we synchronize?

Read-Modify-Write uninterruptible – atomic

atomicAdd() atomicInc()
atomicSub() atomicDec()
atomicMin() atomicExch()
atomicMax() atomicCAS()

87
Recap
• Launching parallel threads
• Launch N blocks with M threads per block with kernel<<<N,M>>>(…);
• Use blockIdx.x to access block index within grid
• Use threadIdx.x to access thread index within block
• Allocate elements to threads:

int index = threadIdx.x + blockIdx.x * blockDim.x;

Use shared to declare a variable/array in

shared memory
Data is shared between threads in a block
Not visible to threads in other blocks

Use __syncthreads() as a barrier

Use to prevent data hazards 88
CONCEPTS Heterogeneous Computing

Blocks

Threads

Indexing

Shared memory

__syncthreads()

Asynchronous operation

Handling errors
MANAGING THE Managing devices

DEVICE

89
Coordinating Host & Device
• Kernel launches are asynchronous
• Control returns to the CPU immediately

• CPU needs to synchronize before consuming the results

cudaMemcpy() Blocks the CPU until the copy is complete

Copy begins when all preceding CUDA calls have
completed
cudaMemcpyAsync() Asynchronous, does not block the CPU
cudaDeviceSynchro Blocks the CPU until all preceding CUDA calls have
nize() completed
90
Reporting Errors
• All CUDA API calls return an error code (cudaError_t)
• Error in the API call itself
OR
• Error in an earlier asynchronous operation (e.g. kernel)

• Get the error code for the last error:

cudaError_t cudaGetLastError(void)
• Get a string to describe the error:
char *cudaGetErrorString(cudaError_t)

printf("%s\n", cudaGetErrorString(cudaGetLastError()));

91
Device Management
• Application can query and select GPUs
cudaGetDeviceCount(int *count)
cudaSetDevice(int device)
cudaGetDevice(int *device)
cudaGetDeviceProperties(cudaDeviceProp *prop, int
device)

• Multiple threads can share a device

• A single thread can manage multiple devices

cudaSetDevice(i) to select current device
cudaMemcpy(…) for peer-to-peer copies✝

✝ requires OS and device support

92
Questions?

Rooms Division Budgeting
67% (6)
Rooms Division Budgeting
41 pages
PP 180828 Shrink Wrapping Manual
100% (2)
PP 180828 Shrink Wrapping Manual
56 pages
Manual of Infection Prevention and Control (PDFDrive)
100% (2)
Manual of Infection Prevention and Control (PDFDrive)
399 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
Class4 Advanced Cuda Opencl
No ratings yet
Class4 Advanced Cuda Opencl
64 pages
Cs344 - Lesson 2 - GPU Hardware and Parallel Communication Patterns - Udacity
No ratings yet
Cs344 - Lesson 2 - GPU Hardware and Parallel Communication Patterns - Udacity
14 pages
Data - Parallel Algorithms On Gpus
No ratings yet
Data - Parallel Algorithms On Gpus
31 pages
09 ParallelizationRecap PDF
No ratings yet
09 ParallelizationRecap PDF
62 pages
Part4 22
No ratings yet
Part4 22
65 pages
Parallel Computing
No ratings yet
Parallel Computing
91 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
cuda_mode_lecture2
No ratings yet
cuda_mode_lecture2
33 pages
02 RTVis GPGPU CUDA
No ratings yet
02 RTVis GPGPU CUDA
34 pages
Lecture 17-Introduction to GPU
No ratings yet
Lecture 17-Introduction to GPU
36 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
ece408-lecture13-reduction-tree-vk-FL24
No ratings yet
ece408-lecture13-reduction-tree-vk-FL24
45 pages
Parallel Programming
No ratings yet
Parallel Programming
42 pages
2-INTRODUCTION TO PDC - MOTIVATION - KEY CONCEPTS-03-Dec-2019Material - I - 03-Dec-2019 - Module - 1 PDF
No ratings yet
2-INTRODUCTION TO PDC - MOTIVATION - KEY CONCEPTS-03-Dec-2019Material - I - 03-Dec-2019 - Module - 1 PDF
63 pages
Cuda 9 and Beyond
100% (1)
Cuda 9 and Beyond
45 pages
PP Cuda Unit1 1
No ratings yet
PP Cuda Unit1 1
77 pages
001__DDS-IIIT-Jan-10th
No ratings yet
001__DDS-IIIT-Jan-10th
34 pages
ECE408MT2ReviewFA24
No ratings yet
ECE408MT2ReviewFA24
58 pages
Partitioning
No ratings yet
Partitioning
37 pages
CUDA Lab Instruction
No ratings yet
CUDA Lab Instruction
40 pages
NVIDIA - Cooperative Groups - Slides - GTC 2017 (S7622-Kyrylo-Perelygin-Robust-And-Scalable-Cuda)
No ratings yet
NVIDIA - Cooperative Groups - Slides - GTC 2017 (S7622-Kyrylo-Perelygin-Robust-And-Scalable-Cuda)
56 pages
UNIT-4
No ratings yet
UNIT-4
48 pages
1
No ratings yet
1
44 pages
IntroGPUs
No ratings yet
IntroGPUs
36 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Week1-Parallel-and-Distributed-Computing
No ratings yet
Week1-Parallel-and-Distributed-Computing
55 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
217-lec10
No ratings yet
217-lec10
27 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
.Trashed-1650000204-Hpc Prac Exam
No ratings yet
.Trashed-1650000204-Hpc Prac Exam
5 pages
Lecture 3
No ratings yet
Lecture 3
24 pages
Parallel_computing
No ratings yet
Parallel_computing
32 pages
ch2 PC
No ratings yet
ch2 PC
44 pages
Paralle Processing in Brief
No ratings yet
Paralle Processing in Brief
31 pages
Week1 - Parallel and Distributed Computing
100% (1)
Week1 - Parallel and Distributed Computing
46 pages
Perspective On Parallel Programming: CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley
No ratings yet
Perspective On Parallel Programming: CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley
42 pages
Lecture1 Introduction PDF
No ratings yet
Lecture1 Introduction PDF
43 pages
Intro To Parallel Computing
No ratings yet
Intro To Parallel Computing
127 pages
chapter-8
No ratings yet
chapter-8
58 pages
01 Cuda c Basics
No ratings yet
01 Cuda c Basics
32 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
Writing Data Parallel Algorithms On GPUs - Ade Miller - CppCon 2014
No ratings yet
Writing Data Parallel Algorithms On GPUs - Ade Miller - CppCon 2014
43 pages
Pawan 09 Graph Algorithms
No ratings yet
Pawan 09 Graph Algorithms
26 pages
Syllabus
No ratings yet
Syllabus
2 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
217 Lec1
No ratings yet
217 Lec1
35 pages
Parallel & Distributed Computing Report
No ratings yet
Parallel & Distributed Computing Report
4 pages
Parallel Hashing: John Erol Evangelista
No ratings yet
Parallel Hashing: John Erol Evangelista
42 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
Part 1 - Lecture 1 - Introduction Parallel Computing
No ratings yet
Part 1 - Lecture 1 - Introduction Parallel Computing
33 pages
Parallel Computing Main
No ratings yet
Parallel Computing Main
47 pages
06-CUDA Thread Organization
No ratings yet
06-CUDA Thread Organization
27 pages
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
No ratings yet
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
47 pages
L19-20 PA Design Intro
No ratings yet
L19-20 PA Design Intro
31 pages
Scanline Rendering: Exploring Visual Realism Through Scanline Rendering Techniques
From Everand
Scanline Rendering: Exploring Visual Realism Through Scanline Rendering Techniques
Fouad Sabry
No ratings yet
Steel Structures - (Introduction) - 3
No ratings yet
Steel Structures - (Introduction) - 3
10 pages
Spoon Fed Porn Videos
No ratings yet
Spoon Fed Porn Videos
1 page
Basic Technical Mathematics With Calculus 10th Edition by Washington ISBN Solution Manual
100% (41)
Basic Technical Mathematics With Calculus 10th Edition by Washington ISBN Solution Manual
74 pages
96 Pkli Ad
No ratings yet
96 Pkli Ad
1 page
stock list of guyana stock
No ratings yet
stock list of guyana stock
2 pages
Federal Reserve Bank of Kansas City, Kansas City, MO 64198, USA
No ratings yet
Federal Reserve Bank of Kansas City, Kansas City, MO 64198, USA
14 pages
Bharti Airtel Limited Performance Appraisal
No ratings yet
Bharti Airtel Limited Performance Appraisal
13 pages
3AUA0000036521
No ratings yet
3AUA0000036521
3 pages
The Ceecec Handbook
No ratings yet
The Ceecec Handbook
533 pages
MYP Self Study Questionnaire
No ratings yet
MYP Self Study Questionnaire
41 pages
Waiver 2023
No ratings yet
Waiver 2023
1 page
Practice Questions & Answers: Made With by Sawzeeyy
No ratings yet
Practice Questions & Answers: Made With by Sawzeeyy
141 pages
F5 Solutions Playbook September 2016 PDF
No ratings yet
F5 Solutions Playbook September 2016 PDF
92 pages
An Assignment Based On Career Planning
No ratings yet
An Assignment Based On Career Planning
12 pages
Design of A 10kHz Filter
No ratings yet
Design of A 10kHz Filter
11 pages
CV AkaResume Template Notes
No ratings yet
CV AkaResume Template Notes
8 pages
Business Ethics and Corporate Social Responsibility: A Holistic Approach
100% (1)
Business Ethics and Corporate Social Responsibility: A Holistic Approach
6 pages
Cmos Digital Circuits - Book
100% (1)
Cmos Digital Circuits - Book
56 pages
EVE_ESS_HVI-60.0_User manual_P_V1
No ratings yet
EVE_ESS_HVI-60.0_User manual_P_V1
79 pages
Safety Data Sheet Sds #: Ninjaflex Semiflex: 1. Product and Company Identification
No ratings yet
Safety Data Sheet Sds #: Ninjaflex Semiflex: 1. Product and Company Identification
5 pages
Business Combinations-Conso at DOA Pt2
No ratings yet
Business Combinations-Conso at DOA Pt2
1 page
Sap Copa Hana
No ratings yet
Sap Copa Hana
23 pages
Stress Strain Diagram
No ratings yet
Stress Strain Diagram
8 pages
Non-Profit Social Enterprises: by Arnav Bhandary and Harshavardhana Rontala
No ratings yet
Non-Profit Social Enterprises: by Arnav Bhandary and Harshavardhana Rontala
20 pages
AIL2-Lesson-3 20240219 112217 0000
No ratings yet
AIL2-Lesson-3 20240219 112217 0000
28 pages
Mark Louies M. Villarosa
No ratings yet
Mark Louies M. Villarosa
1 page
Poultry Industry in Moldova
No ratings yet
Poultry Industry in Moldova
5 pages