Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

CUDAProgModel

This document introduces the CUDA programming model, focusing on GPU programming, the concept of CUDA kernels, and a simple example of adding two vectors. It explains the Single Instruction Multiple Threads (SIMT) model used in GPUs for high parallel performance, and outlines the basic structure of a CUDA program including memory allocation, data transfer, and kernel execution. Additionally, it covers the compilation process using the NVIDIA CUDA compiler (nvcc) and provides examples of CUDA code and memory management.

Uploaded by

azhagar_ss
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

CUDAProgModel

This document introduces the CUDA programming model, focusing on GPU programming, the concept of CUDA kernels, and a simple example of adding two vectors. It explains the Single Instruction Multiple Threads (SIMT) model used in GPUs for high parallel performance, and outlines the basic structure of a CUDA program including memory allocation, data transfer, and kernel execution. Additionally, it covers the compilation process using the NVIDIA CUDA compiler (nvcc) and provides examples of CUDA code and memory management.

Uploaded by

azhagar_ss
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 24

CUDA

Programming Model
These notes will introduce:


Basic GPU programming model

CUDA kernel

Simple CUDA program to add two vectors together

Compiling the code on a Linux system

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 20, 2011 1


Programming Model

GPUs historically designed for creating image data for


displays.

That application involves manipulating image pixels


(picture elements) and often the same operation each
pixel

SIMD (single instruction multiple data) model - An


efficient mode of operation in which the same operation
is done on each data element at the same time

2
SIMD (Single Instruction Multiple Data)
model
Also know as data parallel computation.
One instruction specifies the operation:
Instruction
a[] = a[] + k

ALUs

a[0] a[1] a[n-2] a[n-1]

Very efficient of this is what you want to do. One program.


Can design computers to operate this way.
3
Single Instruction Multiple Thread
Programming Model

A version of SIMD used in GPUs.

GPUs use a thread model to achieve very high parallel


performance and to hide memory latency

Multiple threads, each execute the same instruction sequence.

On a GPU, a very large number of threads (10,000’s) possible.

Threads mapped onto available processors on GPU (100’s of


processors all executing same program sequence)

4
Programming applications
using SIMT model
Matrix operations -- very amenable to SIMT

Same operations done on different elements of matrices

Some “embarassingly” parallel computations such as


Monte Carlo calculations

Monte Carlo calculations use random selections
Random selections are independent of each other

Data manipulations

Some sorting can be done quite efficiently


5
CUDA kernel routine
To write a SIMT program, one needs to write a code
sequence that all the threads on the GPU will do.

In CUDA, this code sequence is called a Kernel routine

Kernal code will be regular C except one typically needs


to use the thread ID in expressions to ensure each thread
accesses different data:
Example

All theads do this
index = ThreadID;
A[index] = B[index] + C[index];
6
CPU and GPU memory

Program once compiled has code
executed on CPU and (kernel) code
executed on GPU CPU

CPU main memory



Separate memories on CPU and GPU
Copy from Copy from
CPU to GPU to
Need to GPU CPU

Explicitly transfer data from CPU to
GPU for GPU computation, and GPU global memory

GPU

Explicitly transfer results in GPU
memory copied back to CPU memory

7
Basic CUDA program structure
int main (int argc, char **argv ) {

1. Allocate memory space in device (GPU) for data


2. Allocate memory space in host (CPU) for data

3. Copy data to GPU

4. Call “kernel” routine to execute on GPU


(with CUDA syntax that defines no of threads and their physical structure)

5. Transfer results from GPU to CPU

6. Free memory space in device (GPU)


7. Free memory space in host (CPU)

return;
}
8
1. Allocating memory space in
“device” (GPU) for data

Use CUDA malloc routines:

int size = N *sizeof( int); // space for N integers

int *devA, *devB, *devC; // devA, devB, devC ptrs

cudaMalloc( (void**)&devA, size) );


cudaMalloc( (void**)&devB, size );
cudaMalloc( (void**)&devC, size );

9
Derived from Jason Sanders, "Introduction to CUDA C" GPU technology conference, Sept. 20, 2010.
2. Allocating memory space in
“host” (CPU) for data
Use regular C malloc routines:
int *a, *b, *c;

a = (int*)malloc(size);
b = (int*)malloc(size);
c = (int*)malloc(size);

or statically declare variables:


#define N 256

int a[N], b[N], c[N];

10
3. Transferring data from host
(CPU) to device (GPU)
Use CUDA routine cudaMemcpy
Destination Source

cudaMemcpy( devA, A, size, cudaMemcpyHostToDevice);


cudaMemcpy( dev_B, B, size, cudaMemcpyHostToDevice);

where:
devA and devB are pointers to destination in device
A and B are pointers to host data
11
4. Declaring “kernel” routine to
execute on device (GPU)
CUDA introduces a syntax addition to C:
Triple angle brackets mark call from host code to device code.
Contains organization and number of threads in two parameters:

myKernel<<< n, m >>>(arg1, … );

n and m will define organization of thread blocks and threads in a


block.

For now, we will set n = 1, which say one block and m = N, which
says N threads in this block.

arg1, … , -- arguments to routine myKernel typically pointers to


device memory obtained previously from cudaMallac.
12
Declaring a Kernel Routine
Two
A kernel defined using CUDA specifier __global__ underscores
each side

Example – Adding to vectors A and B


#define N 256
__global__ void vecAdd(int *A, int *B, int *C) { // Kernel definition

int i = threadIdx.x; CUDA structure that provides thread ID in block


C[i] = A[i] + B[i];
} Each of the N threads performs one pair-
wise addition:
int main() { Thread 0: devC[0] = devA[0] + devB[0];
// allocate device memory & Thread 1: devC[1] = devA[1] + devB[1];
// copy data to device Thread N-1: devC[N-1] = devA[N-1]+devB[N-1];
// device mem. ptrs devA,devB,devC

vecAdd<<<1, N>>>(devA,devB,devC); // Grid of one block, N threads in block



}
13
Loosely derived from CUDA C programming guide, v 3.2 , 2010, NVIDIA
5. Transferring data from device
(GPU) to host (CPU)

Use CUDA routine cudaMemcpy


Destination Source

cudaMemcpy( C, devC, size, cudaMemcpyDeviceToHost);

where:
devC is a pointer in device and C is a pointer in host.

14
6. Free memory space in “device”
(GPU)

Use CUDA cudaFree routine:

cudaFree( dev_a);
cudaFree( dev_b);
cudaFree( dev_c);

15
7. Free memory space in (CPU) host
(if CPU memory allocated with malloc)

Use regular C free routine to deallocate memory if


previously allocated with malloc:

free( a );
free( b );
free( c );

16
#define N 256
Complete
__global__ void vecAdd(int *A, int *B, int *C) {
CUDA int i = threadIdx.x;
C[i] = A[i] + B[i];
program }

int main (int argc, char **argv ) {

int size = N *sizeof( int);


Adding two int a[N], b[N], c[N], *devA, *devB, *devC;
vectors, A and cudaMalloc( (void**)&devA, size) );
B cudaMalloc( (void**)&devB, size );
cudaMalloc( (void**)&devC, size );
N elements in A
and B, and cudaMemcpy( devA, a, size, cudaMemcpyHostToDevice);
cudaMemcpy( devB, b size, cudaMemcpyHostToDevice);
N threads vecAdd<<<1, N>>>(devA, devB, devC);

(without code to cudaMemcpy( c, devC size, cudaMemcpyDeviceToHost);


load arrays with
data) cudaFree( dev_a);
cudaFree( dev_b);
cudaFree( dev_c);

return (0); 17
int main(int argc, char *argv[]) {
int T = 10, B = 1; // threads per block/blocks per grid
Complete, with int a[N],b[N],c[N];
int *dev_a, *dev_b, *dev_c;
keyboard input for printf("Size of array = %d\n", N);
blocks/threads do {
printf("Enter number of threads per block: ");
scanf("%d",&T);
printf("\nEnter nuumber of blocks per grid: ");
(without timing execution, scanf("%d",&B);
see later) if (T * B < N) printf("Error T x B < N, try again");
} while (T * B < N);

cudaMalloc((void**)&dev_a,N * sizeof(int));
cudaMalloc((void**)&dev_b,N * sizeof(int));
cudaMalloc((void**)&dev_c,N * sizeof(int));
#include <stdio.h>
#include <cuda.h> for(int i=0;i<N;i++) { // load arrays with some numbers
#include <stdlib.h> a[i] = i;
#include <time.h> b[i] = i*1;
}
#define N 4096 // size of array
cudaMemcpy(dev_a, a , N*sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b , N*sizeof(int),cudaMemcpyHostToDevice);
__global__ void add(int *a,int *b, int *c) { cudaMemcpy(dev_c, c , N*sizeof(int),cudaMemcpyHostToDevice);
int tid = blockIdx.x*blockDim.x +
threadIdx.x; add<<<B,T>>>(dev_a,dev_b,dev_c);

if(tid < N){ cudaMemcpy(c,dev_c,N*sizeof(int),cudaMemcpyDeviceToHost);


c[tid] = a[tid]+b[tid];
} for(int i=0;i<N;i++) {
printf("%d+%d=%d\n",a[i],b[i],c[i]);
} }

cudaFree(dev_a); // clean up
cudaFree(dev_b);
cudaFree(dev_c);
18
return 0;
Compiling CUDA programs
“nvcc”
NVIDIA provides nvcc -- the NVIDIA CUDA “compiler
driver”.

Will separate out code for host and for device

Regular C/C++ compiler used for host (needs to be


available)

Programmer simply uses nvcc instead of gcc/cc compiler


on a Linux system

Command line options include for GPU features


19
Compiling code - Linux

Command line: Directories for #include files

nvcc –O3 –o <exe> <source_file> -I/usr/local/cuda/include


–L/usr/local/cuda/lib –lcuda –lcudart
Optimization level if
you want optimized Directories for libraries Libraries to be linked
code

CUDA source file that includes device code has the extension .cu
nvcc separates code for CPU and for GPU and compiles code.
Need regular C compiler installed for CPU.
Make file convenient – see next.

See “The CUDA Compiler Driver NVCC” from NVIDIA for more details 20
Very simple sample Make file
NVCC = /usr/local/cuda/bin/nvcc
CUDAPATH = /usr/local/cuda

NVCCFLAGS = -I$(CUDAPATH)/include
LFLAGS = -L$(CUDAPATH)/lib64 -lcuda -lcudart -lm

prog1: A regular C program


cc -o prog1 prog1.c –lm
A C program with X11 graphics
prog2:
cc -I/usr/openwin/include -o prog2 prog2.c -L/usr/openwin/lib -L/usr/X11R6/lib
-lX11 –lm
A CUDA program
prog3:
$(NVCC) $(NVCCFLAGS) $(LFLAGS) -o prog3 prog3.cu
A CUDA program with X11 graphics
prog4:
$(NVCC) $(NVCCFLAGS) $(LFLAGS) -I/usr/openwin/include -o prog4
prog4.cu -L/usr/openwin/lib -L/usr/X11R6/lib -lX11 -lm
21
Compilation process
nvcc “wrapper” divides nvcc –o prog prog.cu –I/includepath -L/libpath
code into host and
device parts.
nvcc
Host part compiled by
regular C compiler
ptxas gcc
Device part compiled
by NVIDIA “ptxas” Combine
assembler Object file

Two compiled parts executable


combined into one
executable Executable file a “fat” binary” with
both host and device code 22
Executing Program

Simple type name of executable created by nvcc:

./prog1

File includes all the code for host and for device in a “fat binary” file

Host code starts running

When first encounter device kernel, GPU code physically sent to


GPU and function launched on GPU
Hence first launch will be slow!!

Run time environment (cudart) controls memcpy timing and


synchronization
23
Questions

You might also like