CUDA Introduction Mod
CUDA Introduction Mod
Processing Unit
l Using graphic hardware for non-graphic
computations
Ø What is CUDA?
l Parallel computing platform and API by Nvidia
programming
l Introduced in 2007; still actively updated 2
Motivation
3
Motivation
4
Motivation
5
CPU vs. GPU
Ø CPU
l Fast caches
l Branching adaptability
l High performance
Ø GPU
l Multiple ALUs
l Fast onboard memory
l High throughput on parallel tasks
• Executes program on each fragment/vertex
6
CPU vs. GPU - Hardware
7
Traditional Graphics Pipeline
Vertex processing
ò
Rasterizer
ò
Fragment processing
ò
Renderer (textures)
8
Pixel / Thread Processing
9
GPU Architecture
10
Processing Element
11
GPU Memory Architecture
Uncached:
Ø Registers
Ø Shared Memory
Ø Local Memory
Ø Global Memory
Cached:
Ø Constant Memory
Ø Texture Memory
12
Data-parallel Programming
Ø Think of the GPU as a massively-threaded
co-processor
Ø Write “kernel” functions that execute on
the device -- processing multiple data
elements in parallel
13
Hardware Requirements
Ø CUDA-capable
video card
Ø Power supply
Ø Cooling
Ø PCI-Express
14
A Gentle Introduction to
CUDA Programming
17
Credits
Ø Thecode used in this presentation is based
on code available in:
l the Tutorial on CUDA in Dr. Dobbs Journal
l Andrew Bellenir’s code for matrix multiplication
l Igor Majdandzic’s code for Voronoi diagrams
l NVIDIA’s CUDA programming guide
18
Software Requirements/Tools
Profiling:
Ø Occupancy calculator
Ø Visual profiler
19
To compute, we need to:
Ø Allocate memory for the computation
on the GPU (incl. variables)
Ø Provide input data
Ø Specify the computation to be performed
Ø Read the results from the GPU (output)
20
Initially:
array
21
Allocate Memory in the GPU
card
array array_d
22
Copy content from the host’s memory to the
GPU card memory
array array_d
23
Execute code on the GPU
GPU MPs
array array_d
24
Copy results back to the host
memory
array array_d
25
The Kernel
Ø The code to be executed in the
stream processors on the GPU
Ø Simultaneous execution in
several (perhaps all) stream
processors on the GPU
26
Grid and Block Size
27
Let’s look at a very simple example
Ø The code has been divided into two files:
l simple.c
l simple.cu
Ø simple.c is ordinary code in C
Ø It allocates an array of integers, initializes
it to values corresponding to the indices in
the array and prints the array.
Ø It calls a function that modifies the array
Ø The array is printed again.
28
simple.c
Ø
#include <stdio.h>
#define SIZEOFARRAY 64
extern void fillArray(int *a,int size);
/* The main program */
int main(int argc,char *argv[])
{
/* Declare the array that will be modified by the GPU */
int a[SIZEOFARRAY];
int i;
/* Initialize the array to 0s */
for(i=0;i < SIZEOFARRAY;i++) {
a[i]=0;
}
/* Print the initial array */
printf("Initial state of the array:\n");
for(i = 0;i < SIZEOFARRAY;i++) {
printf("%d ",a[i]);
}
printf("\n");
/* Call the function that will in turn call the function in the GPU that will fill
the array */
fillArray(a,SIZEOFARRAY);
/* Now print the array after calling fillArray */
printf("Final state of the array:\n");
for(i = 0;i < SIZEOFARRAY;i++) {
printf("%d ",a[i]);
}
printf("\n");
return 0;
}
29
simple.cu
Ø simple.cu contains two functions
l fillArray(): A function that will be executed on
the host and which takes care of:
• Allocating variables in the global GPU memory
• Copying the array from the host to the GPU memory
• Setting the grid and block sizes
• Invoking the kernel that is executed on the GPU
• Copying the values back to the host memory
• Freeing the GPU memory
30
fillArray (part 1)
#define BLOCK_SIZE 32
extern "C" void fillArray(int *array, int arraySize)
{
int * array_d;
cudaError_t result;
31
fillArray (part 2)
/* Indicate block size */
dim3 dimblock(BLOCK_SIZE);
/* Indicate grid size */
dim3 dimgrid(arraySize / BLOCK_SIZE);
32
simple.cu (cont.)
Ø The other function in simple.cu is cu_fillArray():
l Built-in variables:
• blockIdx.x : block index within the grid
• threadIdx.x: thread index within the block
33
cu_fillArray
__global__ void cu_fillArray(int * array_d)
{
int x;
x = blockIdx.x * BLOCK_SIZE + threadIdx.x;
array_d[x] = x;
}
34
To compile:
Ø nvcc simple.c simple.cu –o simple
Ø The compiler generates the code for both
the host and the GPU
Ø Demo on cuda.littlefe.net …
35
In the GPU:
Processing Elements
Array Elements
Block 0 Block 1
37
Another Example: saxpy
Ø SAXPY (Scalar Alpha X Plus Y)
l A common operation in linear algebra
Ø CUDA: loop iteration ð thread
41
Traditional Sequential Code
void saxpy_serial(int n,
float alpha,
float *x,
float *y)
{
for(int i = 0;i < n;i++)
y[i] = alpha*x[i] + y[i];
}
42
CUDA Code
__global__ void saxpy_parallel(int n,
float alpha,
float *x,
float *y) {
int i = blockIdx.x*blockDim.x+threadIdx.x;
if (i<n)
y[i] = alpha*x[i] + y[i];
}
43
“Warps”
Ø Each block is split into SIMD groups of threads
called "warps".
44
warp 1
warp 2 Block 1
warp 3
threads
warp 1
Block 2
warp 2
warp 3
warp 1
Block 3
warp 2
warp 3
Multi-processor 1
warp 1
Block 4
warp 2
warp 3
45
Keeping multiprocessors in mind…
Ø Each multiprocessor can process multiple blocks at a
time.
46
Performance Tip: Block Size
47
Performance Tip:
Grid Size (number of blocks)
Ø Recommended value is at least 100, but 1000 would
scale for many generations of hardware
Ø Example: 24 blocks
l Grid will work efficiently on Series 8 (12 MPs), but it will waste
resources on new GPUs with 32MPs
48
Example: Tesla P100
Ø Launched in 2016
Ø “Pascal” architecture (successors: Volta, Turing)
Ø Double-precision performance: 4.7 TeraFLOPS
Ø Single-precision performance: 9.3 TeraFLOPS
Ø GPU Memory: 16 GB
49
Example: Tesla P100
Ø Number of Multiprocessors (MPs): 56
Ø Number of Cuda Cores per MP: 64
Ø Total number of Cuda Cores: 3584
Ø #Cuda Cores = #number of floating point
instructions that can be processed per cycle
Ø MPs can run multiple threads per core
simultaneously (similar to hyperthreading on CPU)
Ø Hence, #threads can be larger than #cores
50
Memory Alignment
Ø Memory access faster if data aligned at 64
byte boundaries
51
Allocating 2D arrays with “pitch”
Ø CUDA offers special versions of:
52
Pitch
Columns
Padding
Rows
Pitch
53
Dividing the work by blocks:
Columns
Block 0
Rows Block 1
Block 2
Pitch
60
Watchdog timer
Ø OS may force programs using the GPU to time out if
running too long
65
Resources on line
Ø http://www.acmqueue.org/modules.php?name=
Content&pa=showpage&pid=532
Ø http://www.ddj.com/hpc-high-performance-
computing/207200659
Ø http://www.nvidia.com/object/cuda_home.html#
Ø http://www.nvidia.com/object/cuda_learn.html
Ø “Computation of Voronoi diagrams using a
graphics processing unit” by Igor Majdandzic et
al. available through IEEE Digital Library, DOI:
10.1109/EIT.2008.4554342
66