14 Parallel Algorithms CUDA Basics s20
14 Parallel Algorithms CUDA Basics s20
Parallel Algorithms
CUDA
Chris Rossbach
cs378h
Outline for Today
• Questions?
• Administrivia
• Eldar-* machines should be available
• Agenda
• Parallel Algorithms
• CUDA
• Acknowledgements:
http://developer.download.nvidia.com/compute/develo
pertrainingmaterials/presentations/cuda_language/Intro
duction_to_CUDA_C.pptx
2
Faux Quiz Questions
• What is a reduction? A prefix sum? Why are they hard to parallelize and what basic techniques
can be used to parallelize them?
• Define flow dependence, output dependence, and anti-dependence: give an example of each.
Why/how do compilers use them to detect loop-independent vs loop-carried dependences?
• What is the difference between a thread-block and a warp?
• How/Why must programmers copy data back and forth to a GPU?
• What is “shared memory” in CUDA? Describe a setting in which it might be useful.
• CUDA kernels have implicit barrier synchronization. Why is __syncthreads() necessary in light of
this fact?
• How might one implement locks on a GPU?
• What ordering guarantees does a GPU provide across different hardware threads’ access to a
single memory location? To two disjoint locations?
• When is it safe for one GPU thread to wait (e.g. by spinning) for another?
3
Review: what is a vector processor?
Dont decode same instruction
over and over…
Implementation:
• Instruction fetch control logic shared
• Same instruction stream executed on
• Multiple pipelines
• Multiple different operands in parallel
4
When does vector processing help?
Pros? Cons?
Fine-grained multithreading
• Threads interleave instructions
• Round-robin
• Skip stalled threads
• Hardware support required
• Separate PC and register file per thread
• Hardware to control alternating pattern
• Naturally hides delays
• Data hazards, Cache misses
• Pipeline runs with rare stalls
Pros? Cons?
Simultaneous Multithreading (SMT)
• Instructions from multiple threads
issued on same cycle
• Uses register renaming
• dynamic scheduling facility of multi- Skip C
issue architecture
• Hardware support:
• Register files, PCs per thread Skip A
• Temporary result registers pre commit
• Support to sort out which threads get
results from which instructions
Pros? Cons?
Why Vector and Multithreading Background?
GPU:
• A very wide vector machine
• Massively multi-threaded to hide memory latency
• Originally designed for graphics pipelines…
Graphics ~= Rendering
Inputs
• 3D world model(objects, materials)
• Geometry modeled w triangle meshes, surface normals
• GPUs subdivide triangles into “fragments” (rasterization)
• Materials modeled with “textures”
• Texture coordinates, sampling “map” textures →
geometry
• Light locations and properties
• Attempt to model surtface/light interactions with
modeled objects/materials
• View point
Output
• 2D projection seen from the view-point
3/8/2020 12
Grossly over-simplified rendering algorithm
foreach(vertex v in model)
map vmodel → vview
fragment[] frags = {};
foreach triangle t (v0, v1, v2)
frags.add(rasterize(t));
foreach fragment f in frags
choose_color(f);
display(visible_fragments(frags));
3/8/2020 Dandelion 13
Algorithm → Graphics Pipeline
foreach(vertex v in model)
map vmodel → vview
fragment[] frags = {};
foreach triangle t (v0, v1, v2)
frags.add(rasterize(t));
foreach fragment f in frags
choose_color(f);
display(visible_fragments(frags));
OpenGL pipeline
To first order, DirectX looks the same!
3/8/2020 Dandelion 14
Graphics pipeline → GPU architecture
3/8/2020 Dandelion 20
Programming Model
• GPUs are I/O devices, managed by user-code
• “kernels” == “shader programs”
• 1000s of HW-scheduled threads per kernel
• Threads grouped into independent blocks.
• Threads in a block can synchronize (barrier)
• This is the *only* synchronization
• “Grid” == “launch” == “invocation” of a kernel
• a group of blocks (or warps)
Need codes that are 1000s-X
parallel….
3/8/2020 21
Parallel Algorithms
• Sequential algorithms often do not permit easy parallelization
• Does not mean there work has no parallelism
• A different approach can yield parallelism
• but often changes the algorithm
• Parallelizing != just adding locks to a sequential algorithm
• Parallel Patterns
• Map
• Scatter, Gather
• Reduction If you can express your
algorithm using these patterns,
• Scan an apparently fundamentally
• Search, Sort sequential algorithm can be
made parallel
Map
• Inputs
• Array A
• Function f(x)
• map(A, f) → apply f(x) on all elements in A
• Parallelism trivially exposed
• f(x) can be applied in parallel to all elements, in principle
• Gather:
• Read multiple items to single /packed location
• Scatter:
• Write single/packed data item to multiple locations
• Inputs: x, y, indeces, N
foreach(T
foreach(Telem
elemininPF(ints))
ints)
10 30 20 10 20 30 10 {{
key
key = =KeyLambda(elem);
KeyLambda(elem);
10 10 10 30 30 20 20 group
group= =GetGroup(key);
GetGroup(key);
group.Add(elem);
group.Add(elem);
}}
29
GroupBy using parallel primitives
10 30 20 10 20 30 10
30
Sort
• OK, let’s build a parallel sort
31
Summary
Re-expressing apparently sequential algorithms as combinations of
parallel patterns is a common technique when targeting GPUs
• Reductions
• Scans
• Re-orderings (scatter/gather)
• Sort
• Map
32
What is CUDA?
• CUDA Architecture
• Expose GPU parallelism for general-purpose computing
• Retain performance
• CUDA C/C++
• Based on industry-standard C/C++
• Small set of extensions to enable heterogeneous programming
• Straightforward APIs to manage devices, memory etc.
33
Heterogeneous Computing
Blocks
Threads
Indexing
CONCEPTS
Shared memory
__syncthreads()
Asynchronous operation
Handling errors
Managing devices
34
CONCEPTS Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
HELLO WORLD!
Handling errors
Managing devices
Heterogeneous Computing
▪ Terminology:
▪ Host The CPU and its memory (host memory)
▪ Device The GPU and its memory (device memory)
Host Device
36
Heterogeneous Computing
#include <iostream>
#include <algorithm>
#define N 1024
#define RADIUS 3
#define BLOCK_SIZE 16
parallel fn
// Synchronize (ensure all the data is available)
__syncthreads();
int main(void) {
int *in, *out; // host copies of a, b, c
int *d_in, *d_out; // device copies of a, b, c
int size = (N + 2*RADIUS) * sizeof(int);
serial code
cudaMalloc((void **)&d_out, size);
// Copy to device
cudaMemcpy(d_in, in, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_out, out, size, cudaMemcpyHostToDevice);
parallel code
cudaMemcpy(out, d_out, size, cudaMemcpyDeviceToHost);
// Cleanup
free(in); free(out);
cudaFree(d_in); cudaFree(d_out);
return 0;
serial code
}
37
Simple Processing Flow
PCI Bus
38
Simple Processing Flow
PCI Bus
39
Simple Processing Flow
PCI Bus
40
Hello World!
int main(void) {
printf("Hello World!\n");
return 0;
}
Output:
Standard C that runs on the host
$ nvcc
hello_world.
NVIDIA compiler (nvcc) can be used cu
to compile programs with no device $ a.out
code Hello World!
$
41
Hello World! with Device Code
__global__ void mykernel(void) {
}
int main(void) {
mykernel<<<1,1>>>();
printf("Hello World!\n");
return 0;
}
42
Hello World! with Device Code
__global__ void mykernel(void) {
}
43
Hello World! with Device COde
mykernel<<<1,1>>>();
• Triple angle brackets mark a call from host code to device code
• Also called a “kernel launch”
• We’ll return to the parameters (1,1) in a moment
44
Hello World! with Device Code
__global__ void mykernel(void){
}
Output:
int main(void) {
mykernel<<<1,1>>>();
$ nvcc
printf("Hello World!\n");
hello.cu
return 0;
$ a.out
}
Hello World!
$
• mykernel() does nothing,
somewhat anticlimactic!
45
Parallel Programming in CUDA C/C++
• But wait… GPU computing is about
massive parallelism!
a b c
46
Addition on the Device
• A simple kernel to add two integers
47
Addition on the Device
• Note that we use pointers for the variables
48
Memory Management
• Host and device memory are separate entities
• Device pointers point to GPU memory
May be passed to/from host code
May not be dereferenced in host code
• Host pointers point to CPU memory
May be passed to/from device code
May not be dereferenced in device code
49
Addition on the Device: add()
• Returning to our add() kernel
50
Addition on the Device: main()
int main(void) {
int a, b, c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = sizeof(int);
51
Addition on the Device: main()
// Copy inputs to device
cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice);
// Cleanup
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}
52
CONCEPTS Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
RUNNING IN
Handling errors
Managing devices
PARALLEL
53
Moving to Parallel
• GPU computing is about massive parallelism
• So how do we run code in parallel on the device?
add<<< 1, 1 >>>();
add<<< N, 1 >>>();
54
Vector Addition on the Device
• With add() running in parallel we can do vector addition
55
Vector Addition on the Device
__global__ void add(int *a, int *b, int *c) {
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}
56
Vector Addition on the Device: add()
• Returning to our parallelized add() kernel
57
Vector Addition on the Device: main()
#define N 512
int main(void) {
int *a, *b, *c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = N * sizeof(int);
58
Vector Addition on the Device: main()
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}
59
Review
• Difference between host and device • Basic device memory
• Host CPU management
• Device GPU • cudaMalloc()
• cudaMemcpy()
• cudaFree()
• __global__ declares device code
• Executes on the device
• Called from the host • Launching parallel kernels
• Launch N copies of add() with
add<<<N,1>>>(…);
• Passing parameters from host code • Use blockIdx.x to access
to a device function block index
60
CONCEPTS Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
INTRODUCING
Handling errors
Managing devices
THREADS
61
CUDA Threads
• Terminology: a block can be split into parallel threads
// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}
64
CONCEPTS Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
Handling errors
COMBINING THREADS Managing devices
AND BLOCKS
65
Combining Blocks and Threads
• We’ve seen parallel vector addition using:
• Many blocks with one thread each
• One block with many threads
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
67
Indexing Arrays: Example
• Which thread will operate on the red element?
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
M = 8 threadIdx.x = 5
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
blockIdx.x = 2
69
Addition with Blocks and Threads:
main()
#define N (2048*2048)
#define THREADS_PER_BLOCK 512
int main(void) {
int *a, *b, *c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = N * sizeof(int);
70
Addition with Blocks and Threads:
main()
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}
71
Handling Arbitrary Vector Sizes
• Typical problems are not friendly multiples of
blockDim.x
72
Why Bother with Threads?
• Threads seem unnecessary
• They add a level of complexity
• What do we gain?
73
CONCEPTS Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
COOPERATING
Handling errors
Managing devices
THREADS
75
Stencils
• Each pixel → function of neighbors
• Edge detection:
• Blur:
76
1D Stencil
• Consider 1D stencil over 1D array of elements
• Each output element is the sum of input elements within a radius
radius radius
77
Implementation within a block
• Each thread: process 1 output element __global__ void stencil_1d(int *in, int *out) {
// note: idx comp & edge conditions omitted…
• blockDim.x elements per block int result = 0;
for (int offset = -R; offset <= R; offset++)
result += in[idx + offset];
• Input elements read many times
• With radius 3, each input element is read // Store the result
seven times out[idx] = result;
}
78
Implementation within a block
• Each thread: process 1 output element __global__ void stencil_1d(int *in, int *out) {
// note: idx comp & edge conditions omitted…
• blockDim.x elements per block int result = 0;
for (int offset = -R; offset <= R; offset++)
result += in[idx + offset];
• Input elements read many times
• With radius 3, each input element is read // Store the result
seven times out[idx] = result;
}
Why is this a
problem?
79
Sharing Data Between Threads
• Terminology: within a block, threads share data via shared memory
80
Stencil with Shared Memory
• Cache data in shared memory
– Read (blockDim.x + 2 * radius) elements from memory to shared
– Compute blockDim.x output elements
– Write blockDim.x output elements to global memory
▪ Suppose thread 15 reads the halo before thread 0 has fetched it…
int result = 0;
result += temp[lindex + 1]; Load from temp[19]
83
__syncthreads()
• void __syncthreads();
84
Correct Stencil Kernel
__global__ void stencil_1d(int *in, int *out) {
__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
int gindex = threadIdx.x + blockIdx.x * blockDim.x;
int lindex = threadIdx.x + RADIUS;
87
Recap
• Launching parallel threads
• Launch N blocks with M threads per block with kernel<<<N,M>>>(…);
• Use blockIdx.x to access block index within grid
• Use threadIdx.x to access thread index within block
• Allocate elements to threads:
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
Handling errors
MANAGING THE Managing devices
DEVICE
89
Coordinating Host & Device
• Kernel launches are asynchronous
• Control returns to the CPU immediately
printf("%s\n", cudaGetErrorString(cudaGetLastError()));
91
Device Management
• Application can query and select GPUs
cudaGetDeviceCount(int *count)
cudaSetDevice(int device)
cudaGetDevice(int *device)
cudaGetDeviceProperties(cudaDeviceProp *prop, int
device)
93