Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
3 views

14 Parallel Algorithms CUDA Basics s20

Uploaded by

csestaff23
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

14 Parallel Algorithms CUDA Basics s20

Uploaded by

csestaff23
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 89

Parallel Architectures

Parallel Algorithms
CUDA
Chris Rossbach
cs378h
Outline for Today

• Questions?
• Administrivia
• Eldar-* machines should be available
• Agenda
• Parallel Algorithms
• CUDA

• Acknowledgements:
http://developer.download.nvidia.com/compute/develo
pertrainingmaterials/presentations/cuda_language/Intro
duction_to_CUDA_C.pptx

2
Faux Quiz Questions
• What is a reduction? A prefix sum? Why are they hard to parallelize and what basic techniques
can be used to parallelize them?
• Define flow dependence, output dependence, and anti-dependence: give an example of each.
Why/how do compilers use them to detect loop-independent vs loop-carried dependences?
• What is the difference between a thread-block and a warp?
• How/Why must programmers copy data back and forth to a GPU?
• What is “shared memory” in CUDA? Describe a setting in which it might be useful.
• CUDA kernels have implicit barrier synchronization. Why is __syncthreads() necessary in light of
this fact?
• How might one implement locks on a GPU?
• What ordering guarantees does a GPU provide across different hardware threads’ access to a
single memory location? To two disjoint locations?
• When is it safe for one GPU thread to wait (e.g. by spinning) for another?

3
Review: what is a vector processor?
Dont decode same instruction
over and over…
Implementation:
• Instruction fetch control logic shared
• Same instruction stream executed on
• Multiple pipelines
• Multiple different operands in parallel

4
When does vector processing help?

What are the potential bottlenecks here?


When can it improve throughput?

Only helps if memory can keep the pipeline busy!


Hardware multi-threading
• Address memory bottleneck
• Share exec unit across
• Instruction streams
• Switch on stalls
• Looks like multiple cores to the OS
• Three variants:
• Coarse
• Fine-grain
• Simultaneous
Running example

Thread A Thread B Thread C Thread D

• Colors → pipeline full


• White → stall
Coarse- grained multithreading
• Single thread runs until a costly stall
• E.g. 2nd level cache miss
• Another thread starts during stall
• Pipeline fill time requires several cycles!
• Hardware support required
• PC and register file for each thread
• Looks like another physical CPU to
OS/software

Pros? Cons?
Fine-grained multithreading
• Threads interleave instructions
• Round-robin
• Skip stalled threads
• Hardware support required
• Separate PC and register file per thread
• Hardware to control alternating pattern
• Naturally hides delays
• Data hazards, Cache misses
• Pipeline runs with rare stalls

Pros? Cons?
Simultaneous Multithreading (SMT)
• Instructions from multiple threads
issued on same cycle
• Uses register renaming
• dynamic scheduling facility of multi- Skip C
issue architecture
• Hardware support:
• Register files, PCs per thread Skip A
• Temporary result registers pre commit
• Support to sort out which threads get
results from which instructions

Pros? Cons?
Why Vector and Multithreading Background?
GPU:
• A very wide vector machine
• Massively multi-threaded to hide memory latency
• Originally designed for graphics pipelines…
Graphics ~= Rendering
Inputs
• 3D world model(objects, materials)
• Geometry modeled w triangle meshes, surface normals
• GPUs subdivide triangles into “fragments” (rasterization)
• Materials modeled with “textures”
• Texture coordinates, sampling “map” textures →
geometry
• Light locations and properties
• Attempt to model surtface/light interactions with
modeled objects/materials
• View point

Output
• 2D projection seen from the view-point
3/8/2020 12
Grossly over-simplified rendering algorithm
foreach(vertex v in model)
map vmodel → vview
fragment[] frags = {};
foreach triangle t (v0, v1, v2)
frags.add(rasterize(t));
foreach fragment f in frags
choose_color(f);
display(visible_fragments(frags));

3/8/2020 Dandelion 13
Algorithm → Graphics Pipeline
foreach(vertex v in model)
map vmodel → vview
fragment[] frags = {};
foreach triangle t (v0, v1, v2)
frags.add(rasterize(t));
foreach fragment f in frags
choose_color(f);
display(visible_fragments(frags));
OpenGL pipeline
To first order, DirectX looks the same!

3/8/2020 Dandelion 14
Graphics pipeline → GPU architecture

Limited “programmability” of shaders:


GeForce 6 series
Minimal/no control flow
3/8/2020 Maximum instruction count Dandelion 15
Late Modernity: unified shaders

Mapping to Graphics pipeline no longer apparent


Processing elements no longer specialized to a particular role
3/8/2020 Model supports real control flow, larger instr count
Dandelion 16
Mostly Modern: Pascal
Definitely Modern: Turing
Cross-generational GPU observations
GPUs designed for parallelism in • Simple cores
graphics pipeline:
• Data • Single instruction stream
• Per-vertex Even as• GPU
Vector instructions
architectures become (SIMD)
more OR
• Per-fragment • certain
general, Implicit HW-managed
assumptions persist: sharing
• Per-pixel 1. Data parallelism
(SIMT) is trivially exposed
• Task 2. All problems look like painting a box
• Vertex processing • with
Hide memory
colored dots latency with
• Fragment processing
• Rasterization
HW multi-threading
• Hidden-surface elimination But what if my problem isn’t
• MLP painting a box?!!?!
• HW multi-threading for hiding memory
latency

3/8/2020 Dandelion 20
Programming Model
• GPUs are I/O devices, managed by user-code
• “kernels” == “shader programs”
• 1000s of HW-scheduled threads per kernel
• Threads grouped into independent blocks.
• Threads in a block can synchronize (barrier)
• This is the *only* synchronization
• “Grid” == “launch” == “invocation” of a kernel
• a group of blocks (or warps)
Need codes that are 1000s-X
parallel….
3/8/2020 21
Parallel Algorithms
• Sequential algorithms often do not permit easy parallelization
• Does not mean there work has no parallelism
• A different approach can yield parallelism
• but often changes the algorithm
• Parallelizing != just adding locks to a sequential algorithm
• Parallel Patterns
• Map
• Scatter, Gather
• Reduction If you can express your
algorithm using these patterns,
• Scan an apparently fundamentally
• Search, Sort sequential algorithm can be
made parallel
Map
• Inputs
• Array A
• Function f(x)
• map(A, f) → apply f(x) on all elements in A
• Parallelism trivially exposed
• f(x) can be applied in parallel to all elements, in principle

for(i=0; i<numPoints; i++) {


labels[i] = findNearestCenter(points[i]); map(points, findNearestCenter)
}
Why is this useful on
a box-drawing
Scatter and Gather machine?

• Gather:
• Read multiple items to single /packed location
• Scatter:
• Write single/packed data item to multiple locations
• Inputs: x, y, indeces, N

for (i=0; i<N; ++i)


x[i] = y[idx[i]]; gather(x, y, idx)

for (i=0; i<N; ++i) scatter(x, y, idx)


y[idx[i]] = x[i];
24
Reduce
• Input
• Associative operator op
• Ordered set s = [a, b, c, … z]
• Reduce(op, s) returns a op b op c … op z

for(i=0; i<N; ++i) {


accum += (point[i]*point[i]) accum = reduce(*, point)
}

Why must op be associative?


Scan (prefix sum)
• Input
• Associative operator op
• Ordered set s = [a, b, c, … z]
• Identity I

• scan(op, s) = [I, a, (a op b), (a op b op c) …]

• Scan is the workhorse of parallel algorithms:


• Sort, histograms, sparse matrix, string compare, …
GroupBy
• Group a collection by key
• Lambda function maps elements → key
var res = ints.GroupBy(x => x);

foreach(T
foreach(Telem
elemininPF(ints))
ints)
10 30 20 10 20 30 10 {{
key
key = =KeyLambda(elem);
KeyLambda(elem);

10 10 10 30 30 20 20 group
group= =GetGroup(key);
GetGroup(key);

group.Add(elem);
group.Add(elem);
}}

29
GroupBy using parallel primitives
10 30 20 10 20 30 10

Assign group IDs Sorting or hashing


10 20 30
Group ID : 0 1 2
Hash table lookup: group ID
Compute group sizes
-- Uses atomic increment
10 20 30
Group ID : 0 1 2
Group Size : 3 2 2

Compute start indices


prefix sum of group sizes
10 20 30
Group ID : 0 1 2
Group Start Index : 0 3 5
Write to output location
Write Outputs – Uses atomic increment
10 10 10 20 20 30 30 – Scatter gather

30
Sort
• OK, let’s build a parallel sort

31
Summary
Re-expressing apparently sequential algorithms as combinations of
parallel patterns is a common technique when targeting GPUs

• Reductions
• Scans
• Re-orderings (scatter/gather)
• Sort
• Map

32
What is CUDA?
• CUDA Architecture
• Expose GPU parallelism for general-purpose computing
• Retain performance

• CUDA C/C++
• Based on industry-standard C/C++
• Small set of extensions to enable heterogeneous programming
• Straightforward APIs to manage devices, memory etc.

33
Heterogeneous Computing

Blocks

Threads

Indexing
CONCEPTS
Shared memory

__syncthreads()

Asynchronous operation

Handling errors

Managing devices
34
CONCEPTS Heterogeneous Computing

Blocks

Threads

Indexing

Shared memory

__syncthreads()

Asynchronous operation

HELLO WORLD!
Handling errors

Managing devices
Heterogeneous Computing
▪ Terminology:
▪ Host The CPU and its memory (host memory)
▪ Device The GPU and its memory (device memory)

Host Device
36
Heterogeneous Computing
#include <iostream>
#include <algorithm>

using namespace std;

#define N 1024
#define RADIUS 3
#define BLOCK_SIZE 16

__global__ void stencil_1d(int *in, int *out) {


__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
int gindex = threadIdx.x + blockIdx.x * blockDim.x;
int lindex = threadIdx.x + RADIUS;

// Read input elements into shared memory


temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {
temp[lindex - RADIUS] = in[gindex - RADIUS];
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}

parallel fn
// Synchronize (ensure all the data is available)
__syncthreads();

// Apply the stencil


int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
result += temp[lindex + offset];

// Store the result


out[gindex] = result;
}

void fill_ints(int *x, int n) {


fill_n(x, n, 1);
}

int main(void) {
int *in, *out; // host copies of a, b, c
int *d_in, *d_out; // device copies of a, b, c
int size = (N + 2*RADIUS) * sizeof(int);

// Alloc space for host copies and setup values


in = (int *)malloc(size); fill_ints(in, N + 2*RADIUS);
out = (int *)malloc(size); fill_ints(out, N + 2*RADIUS);

// Alloc space for device copies


cudaMalloc((void **)&d_in, size);

serial code
cudaMalloc((void **)&d_out, size);

// Copy to device
cudaMemcpy(d_in, in, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_out, out, size, cudaMemcpyHostToDevice);

// Launch stencil_1d() kernel on GPU


stencil_1d<<<N/BLOCK_SIZE,BLOCK_SIZE>>>(d_in + RADIUS,
d_out + RADIUS);

// Copy result back to host

parallel code
cudaMemcpy(out, d_out, size, cudaMemcpyDeviceToHost);

// Cleanup
free(in); free(out);
cudaFree(d_in); cudaFree(d_out);
return 0;

serial code
}

37
Simple Processing Flow

PCI Bus

1. Copy input data from CPU memory to


GPU memory

38
Simple Processing Flow

PCI Bus

1. Copy input data from CPU memory to


GPU memory
2. Load GPU program and execute,
caching data on chip for performance

39
Simple Processing Flow

PCI Bus

1. Copy input data from CPU memory to


GPU memory
2. Load GPU program and execute,
caching data on chip for performance
3. Copy results from GPU memory to
CPU memory

40
Hello World!
int main(void) {
printf("Hello World!\n");
return 0;
}
Output:
Standard C that runs on the host
$ nvcc
hello_world.
NVIDIA compiler (nvcc) can be used cu
to compile programs with no device $ a.out
code Hello World!
$

41
Hello World! with Device Code
__global__ void mykernel(void) {
}

int main(void) {
mykernel<<<1,1>>>();
printf("Hello World!\n");
return 0;
}

▪ Two new syntactic elements…

42
Hello World! with Device Code
__global__ void mykernel(void) {
}

• CUDA C/C++ keyword __global__ indicates a function that:


• Runs on the device
• Is called from host code

• nvcc separates source code into host and device components


• Device functions (e.g. mykernel()) processed by NVIDIA compiler
• Host functions (e.g. main()) processed by standard host compiler
• gcc, cl.exe

43
Hello World! with Device COde
mykernel<<<1,1>>>();

• Triple angle brackets mark a call from host code to device code
• Also called a “kernel launch”
• We’ll return to the parameters (1,1) in a moment

• That’s all that is required to execute a function on the GPU!

44
Hello World! with Device Code
__global__ void mykernel(void){
}

Output:
int main(void) {
mykernel<<<1,1>>>();
$ nvcc
printf("Hello World!\n");
hello.cu
return 0;
$ a.out
}
Hello World!
$
• mykernel() does nothing,
somewhat anticlimactic!

45
Parallel Programming in CUDA C/C++
• But wait… GPU computing is about
massive parallelism!

• We need a more interesting example…

• We’ll start by adding two integers and


build up to vector addition

a b c

46
Addition on the Device
• A simple kernel to add two integers

__global__ void add(int *a, int *b, int *c) {


*c = *a + *b;
}

• As before __global__ is a CUDA C/C++ keyword meaning


• add() will execute on the device
• add() will be called from the host

47
Addition on the Device
• Note that we use pointers for the variables

__global__ void add(int *a, int *b, int *c) {


*c = *a + *b;
}

• add() runs on the device, so a, b and c must point to device memory

• We need to allocate memory on the GPU

48
Memory Management
• Host and device memory are separate entities
• Device pointers point to GPU memory
May be passed to/from host code
May not be dereferenced in host code
• Host pointers point to CPU memory
May be passed to/from device code
May not be dereferenced in device code

• Simple CUDA API for handling device memory


• cudaMalloc(), cudaFree(), cudaMemcpy()
• Similar to the C equivalents malloc(), free(), memcpy()

49
Addition on the Device: add()
• Returning to our add() kernel

__global__ void add(int *a, int *b, int *c) {


*c = *a + *b;
}

• Let’s take a look at main()…

50
Addition on the Device: main()
int main(void) {
int a, b, c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = sizeof(int);

// Allocate space for device copies of a, b, c


cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);

// Setup input values


a = 2;
b = 7;

51
Addition on the Device: main()
// Copy inputs to device
cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU


add<<<1,1>>>(d_a, d_b, d_c);

// Copy result back to host


cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost);

// Cleanup
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}

52
CONCEPTS Heterogeneous Computing

Blocks

Threads

Indexing

Shared memory

__syncthreads()

Asynchronous operation

RUNNING IN
Handling errors

Managing devices

PARALLEL

53
Moving to Parallel
• GPU computing is about massive parallelism
• So how do we run code in parallel on the device?

add<<< 1, 1 >>>();

add<<< N, 1 >>>();

• Instead of executing add() once, execute N times in parallel

54
Vector Addition on the Device
• With add() running in parallel we can do vector addition

• Terminology: each parallel invocation of add() is a block


• The set of blocks is referred to as a grid
• Each invocation can refer to its block index using blockIdx.x

__global__ void add(int *a, int *b, int *c) {


c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}

• By using blockIdx.x to index into the array, each block handles a


different index

55
Vector Addition on the Device
__global__ void add(int *a, int *b, int *c) {
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}

• On the device, each block can execute in parallel:

Block 0 Block 1 Block 2 Block 3


c[0] = a[0] + b[0]; c[1] = a[1] + b[1]; c[2] = a[2] + b[2]; c[3] = a[3] + b[3];

56
Vector Addition on the Device: add()
• Returning to our parallelized add() kernel

__global__ void add(int *a, int *b, int *c) {


c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}

• Let’s take a look at main()…

57
Vector Addition on the Device: main()
#define N 512
int main(void) {
int *a, *b, *c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = N * sizeof(int);

// Alloc space for device copies of a, b, c


cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);

// Alloc space for host copies of a, b, c and setup input values


a = (int *)malloc(size); random_ints(a, N);
b = (int *)malloc(size); random_ints(b, N);
c = (int *)malloc(size);

58
Vector Addition on the Device: main()
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU with N blocks


add<<<N,1>>>(d_a, d_b, d_c);

// Copy result back to host


cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}

59
Review
• Difference between host and device • Basic device memory
• Host CPU management
• Device GPU • cudaMalloc()
• cudaMemcpy()
• cudaFree()
• __global__ declares device code
• Executes on the device
• Called from the host • Launching parallel kernels
• Launch N copies of add() with
add<<<N,1>>>(…);
• Passing parameters from host code • Use blockIdx.x to access
to a device function block index

60
CONCEPTS Heterogeneous Computing

Blocks

Threads

Indexing

Shared memory

__syncthreads()

Asynchronous operation

INTRODUCING
Handling errors

Managing devices

THREADS

61
CUDA Threads
• Terminology: a block can be split into parallel threads

• Change add()to use parallel threads instead of parallel blocks:


__global__ void add(int *a, int *b, int *c) {
c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x];
}
• Use threadIdx.x instead of blockIdx.x

• Need to make one change in main()…


62
Vector Addition Using Threads: main()
#define N 512
int main(void) {
int *a, *b, *c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = N * sizeof(int);

// Alloc space for device copies of a, b, c


cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);

// Alloc space for host copies of a, b, c and setup input values


a = (int *)malloc(size); random_ints(a, N);
b = (int *)malloc(size); random_ints(b, N);
c = (int *)malloc(size);
63
Vector Addition Using Threads: main()
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU with N threads


add<<<1,N>>>(d_a, d_b, d_c);

// Copy result back to host


cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}
64
CONCEPTS Heterogeneous Computing

Blocks

Threads

Indexing

Shared memory

__syncthreads()

Asynchronous operation

Handling errors
COMBINING THREADS Managing devices

AND BLOCKS

65
Combining Blocks and Threads
• We’ve seen parallel vector addition using:
• Many blocks with one thread each
• One block with many threads

• Let’s adapt vector addition to use both blocks and threads

• Why? We’ll come to that…

• First let’s discuss data indexing…


66
Indexing Arrays with Blocks and Threads
• No longer as simple as using blockIdx.x and threadIdx.x
• Index an array with one elem. per thread (8 threads/block)

threadIdx.x threadIdx.x threadIdx.x threadIdx.x

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

blockIdx.x = 0 blockIdx.x = 1 blockIdx.x = 2 blockIdx.x = 3

• With M threads/block, unique index per thread is :


int index = threadIdx.x + blockIdx.x * M;

67
Indexing Arrays: Example
• Which thread will operate on the red element?

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

M = 8 threadIdx.x = 5

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

blockIdx.x = 2

int index = threadIdx.x + blockIdx.x * M;


= 5 + 2 * 8;
= 21;
68
Vector Addition with Blocks and Threads
• Use the built-in variable blockDim.x for threads per block
int index = threadIdx.x + blockIdx.x * blockDim.x;

• Combined add() using parallel threads and blocks


__global__ void add(int *a, int *b, int *c) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
c[index] = a[index] + b[index];
}

• What changes need to be made in main()?

69
Addition with Blocks and Threads:
main()
#define N (2048*2048)
#define THREADS_PER_BLOCK 512
int main(void) {
int *a, *b, *c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = N * sizeof(int);

// Alloc space for device copies of a, b, c


cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);

// Alloc space for host copies of a, b, c and setup input values


a = (int *)malloc(size); random_ints(a, N);
b = (int *)malloc(size); random_ints(b, N);
c = (int *)malloc(size);

70
Addition with Blocks and Threads:
main()
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU


add<<<N/THREADS_PER_BLOCK,,THREADS_PER_BLOCK>>>(d_a, d_b, d_c);

// Copy result back to host


cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}

71
Handling Arbitrary Vector Sizes
• Typical problems are not friendly multiples of
blockDim.x

• Avoid accessing beyond the end of the arrays:


__global__ void add(int *a, int *b, int *c, int n) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
if (index < n)
c[index] = a[index] + b[index];
}

• Update the kernel launch:


add<<<(N + M-1) / M, M>>>(d_a, d_b, d_c, N);

72
Why Bother with Threads?
• Threads seem unnecessary
• They add a level of complexity
• What do we gain?

• Unlike parallel blocks, threads have mechanisms to:


• Communicate
• Synchronize

• To look closer, we need a new example…

73
CONCEPTS Heterogeneous Computing

Blocks

Threads

Indexing

Shared memory

__syncthreads()

Asynchronous operation

COOPERATING
Handling errors

Managing devices

THREADS

75
Stencils
• Each pixel → function of neighbors
• Edge detection:

• Blur:

76
1D Stencil
• Consider 1D stencil over 1D array of elements
• Each output element is the sum of input elements within a radius

• Radius == 3 → each output element is sum of 7 input elements:

radius radius

77
Implementation within a block
• Each thread: process 1 output element __global__ void stencil_1d(int *in, int *out) {
// note: idx comp & edge conditions omitted…
• blockDim.x elements per block int result = 0;
for (int offset = -R; offset <= R; offset++)
result += in[idx + offset];
• Input elements read many times
• With radius 3, each input element is read // Store the result
seven times out[idx] = result;
}

78
Implementation within a block
• Each thread: process 1 output element __global__ void stencil_1d(int *in, int *out) {
// note: idx comp & edge conditions omitted…
• blockDim.x elements per block int result = 0;
for (int offset = -R; offset <= R; offset++)
result += in[idx + offset];
• Input elements read many times
• With radius 3, each input element is read // Store the result
seven times out[idx] = result;
}

Why is this a
problem?

79
Sharing Data Between Threads
• Terminology: within a block, threads share data via shared memory

• Extremely fast on-chip memory, user-managed

• Declare using __shared__, allocated per block

• Data is not visible to threads in other blocks

80
Stencil with Shared Memory
• Cache data in shared memory
– Read (blockDim.x + 2 * radius) elements from memory to shared
– Compute blockDim.x output elements
– Write blockDim.x output elements to global memory

– Each block needs a halo of radius elements at each boundary

halo on left halo on right

blockDim.x output elements 81


Stencil Kernel
__global__ void stencil_1d(int *in, int *out) {
__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
int gindex = threadIdx.x + blockIdx.x * blockDim.x;
int lindex = threadIdx.x + RADIUS;

// Read input elements into shared memory


temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {
temp[lindex - RADIUS] = in[gindex - RADIUS];
temp[lindex + BLOCK_SIZE] =
in[gindex + BLOCK_SIZE];
}
// Apply the stencil
int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
result += temp[lindex + offset];

// Store the result Are we


82 done?
out[gindex] = result;
}
Data Race!
▪ The stencil example will not work…

▪ Suppose thread 15 reads the halo before thread 0 has fetched it…

temp[lindex] = in[gindex]; Store at temp[18]


if (threadIdx.x < RADIUS) {
temp[lindex – RADIUS = in[gindex – RADIUS]; Skipped, threadIdx > RADIUS
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}

int result = 0;
result += temp[lindex + 1]; Load from temp[19]

83
__syncthreads()
• void __syncthreads();

• Synchronizes all threads within a block


– Used to prevent RAW / WAR / WAW hazards

• All threads must reach the barrier


– In conditional code, the condition must be uniform across the block

84
Correct Stencil Kernel
__global__ void stencil_1d(int *in, int *out) {
__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
int gindex = threadIdx.x + blockIdx.x * blockDim.x;
int lindex = threadIdx.x + RADIUS;

// Read input elements into shared memory


temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {
temp[lindex - RADIUS] = in[gindex - RADIUS];
temp[lindex + BLOCK_SIZE] =
in[gindex + BLOCK_SIZE];
}
__syncthreads();
// Apply the stencil
int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
result += temp[lindex + offset];
85
// Store the result
out[gindex] = result;
}
Notes on __syncthreads()
• void __syncthreads();
__global__ void some_kernel(int *in, int *out) {
• Synchronizes all threads within a block // good idea?
if(threadIdx.x == SOME_VALUE)
– Used to prevent RAW / WAR / WAW hazards __syncthreads();
}

• All threads must reach the barrier


– In conditional code, the condition must be uniform across the block
__device__ void lock_trick(int *in, int *out) {
__syncthreads();
if(myIndex == 0)
critical_section();
__syncthreads(); 86
}
Atomics
Race conditions –
• Traditional locks are to be avoided
• How do we synchronize?

Read-Modify-Write uninterruptible – atomic


atomicAdd() atomicInc()
atomicSub() atomicDec()
atomicMin() atomicExch()
atomicMax() atomicCAS()

87
Recap
• Launching parallel threads
• Launch N blocks with M threads per block with kernel<<<N,M>>>(…);
• Use blockIdx.x to access block index within grid
• Use threadIdx.x to access thread index within block
• Allocate elements to threads:

int index = threadIdx.x + blockIdx.x * blockDim.x;

Use __shared__ to declare a variable/array in


shared memory
Data is shared between threads in a block
Not visible to threads in other blocks

Use __syncthreads() as a barrier


Use to prevent data hazards 88
CONCEPTS Heterogeneous Computing

Blocks

Threads

Indexing

Shared memory

__syncthreads()

Asynchronous operation

Handling errors
MANAGING THE Managing devices

DEVICE

89
Coordinating Host & Device
• Kernel launches are asynchronous
• Control returns to the CPU immediately

• CPU needs to synchronize before consuming the results

cudaMemcpy() Blocks the CPU until the copy is complete


Copy begins when all preceding CUDA calls have
completed
cudaMemcpyAsync() Asynchronous, does not block the CPU
cudaDeviceSynchro Blocks the CPU until all preceding CUDA calls have
nize() completed
90
Reporting Errors
• All CUDA API calls return an error code (cudaError_t)
• Error in the API call itself
OR
• Error in an earlier asynchronous operation (e.g. kernel)

• Get the error code for the last error:


cudaError_t cudaGetLastError(void)
• Get a string to describe the error:
char *cudaGetErrorString(cudaError_t)

printf("%s\n", cudaGetErrorString(cudaGetLastError()));

91
Device Management
• Application can query and select GPUs
cudaGetDeviceCount(int *count)
cudaSetDevice(int device)
cudaGetDevice(int *device)
cudaGetDeviceProperties(cudaDeviceProp *prop, int
device)

• Multiple threads can share a device

• A single thread can manage multiple devices


cudaSetDevice(i) to select current device
cudaMemcpy(…) for peer-to-peer copies✝

✝ requires OS and device support


92
Questions?

93

You might also like