CUDAProgModel
CUDAProgModel
Programming Model
These notes will introduce:
•
Basic GPU programming model
•
CUDA kernel
•
Simple CUDA program to add two vectors together
•
Compiling the code on a Linux system
2
SIMD (Single Instruction Multiple Data)
model
Also know as data parallel computation.
One instruction specifies the operation:
Instruction
a[] = a[] + k
ALUs
4
Programming applications
using SIMT model
Matrix operations -- very amenable to SIMT
•
Same operations done on different elements of matrices
Data manipulations
•
Some sorting can be done quite efficiently
…
5
CUDA kernel routine
To write a SIMT program, one needs to write a code
sequence that all the threads on the GPU will do.
GPU
•
Explicitly transfer results in GPU
memory copied back to CPU memory
7
Basic CUDA program structure
int main (int argc, char **argv ) {
return;
}
8
1. Allocating memory space in
“device” (GPU) for data
9
Derived from Jason Sanders, "Introduction to CUDA C" GPU technology conference, Sept. 20, 2010.
2. Allocating memory space in
“host” (CPU) for data
Use regular C malloc routines:
int *a, *b, *c;
…
a = (int*)malloc(size);
b = (int*)malloc(size);
c = (int*)malloc(size);
10
3. Transferring data from host
(CPU) to device (GPU)
Use CUDA routine cudaMemcpy
Destination Source
where:
devA and devB are pointers to destination in device
A and B are pointers to host data
11
4. Declaring “kernel” routine to
execute on device (GPU)
CUDA introduces a syntax addition to C:
Triple angle brackets mark call from host code to device code.
Contains organization and number of threads in two parameters:
myKernel<<< n, m >>>(arg1, … );
For now, we will set n = 1, which say one block and m = N, which
says N threads in this block.
where:
devC is a pointer in device and C is a pointer in host.
14
6. Free memory space in “device”
(GPU)
cudaFree( dev_a);
cudaFree( dev_b);
cudaFree( dev_c);
15
7. Free memory space in (CPU) host
(if CPU memory allocated with malloc)
free( a );
free( b );
free( c );
16
#define N 256
Complete
__global__ void vecAdd(int *A, int *B, int *C) {
CUDA int i = threadIdx.x;
C[i] = A[i] + B[i];
program }
return (0); 17
int main(int argc, char *argv[]) {
int T = 10, B = 1; // threads per block/blocks per grid
Complete, with int a[N],b[N],c[N];
int *dev_a, *dev_b, *dev_c;
keyboard input for printf("Size of array = %d\n", N);
blocks/threads do {
printf("Enter number of threads per block: ");
scanf("%d",&T);
printf("\nEnter nuumber of blocks per grid: ");
(without timing execution, scanf("%d",&B);
see later) if (T * B < N) printf("Error T x B < N, try again");
} while (T * B < N);
cudaMalloc((void**)&dev_a,N * sizeof(int));
cudaMalloc((void**)&dev_b,N * sizeof(int));
cudaMalloc((void**)&dev_c,N * sizeof(int));
#include <stdio.h>
#include <cuda.h> for(int i=0;i<N;i++) { // load arrays with some numbers
#include <stdlib.h> a[i] = i;
#include <time.h> b[i] = i*1;
}
#define N 4096 // size of array
cudaMemcpy(dev_a, a , N*sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b , N*sizeof(int),cudaMemcpyHostToDevice);
__global__ void add(int *a,int *b, int *c) { cudaMemcpy(dev_c, c , N*sizeof(int),cudaMemcpyHostToDevice);
int tid = blockIdx.x*blockDim.x +
threadIdx.x; add<<<B,T>>>(dev_a,dev_b,dev_c);
cudaFree(dev_a); // clean up
cudaFree(dev_b);
cudaFree(dev_c);
18
return 0;
Compiling CUDA programs
“nvcc”
NVIDIA provides nvcc -- the NVIDIA CUDA “compiler
driver”.
CUDA source file that includes device code has the extension .cu
nvcc separates code for CPU and for GPU and compiles code.
Need regular C compiler installed for CPU.
Make file convenient – see next.
See “The CUDA Compiler Driver NVCC” from NVIDIA for more details 20
Very simple sample Make file
NVCC = /usr/local/cuda/bin/nvcc
CUDAPATH = /usr/local/cuda
NVCCFLAGS = -I$(CUDAPATH)/include
LFLAGS = -L$(CUDAPATH)/lib64 -lcuda -lcudart -lm
./prog1
File includes all the code for host and for device in a “fat binary” file