0% found this document useful (0 votes)

2 views

CUDAProgModel

This document introduces the CUDA programming model, focusing on GPU programming, the concept of CUDA kernels, and a simple example of adding two vectors. It explains the Single Instruction Multiple Threads (SIMT) model used in GPUs for high parallel performance, and outlines the basic structure of a CUDA program including memory allocation, data transfer, and kernel execution. Additionally, it covers the compilation process using the NVIDIA CUDA compiler (nvcc) and provides examples of CUDA code and memory management.

Uploaded by

azhagar_ss

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

CUDAProgModel

Uploaded by

azhagar_ss

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 24

CUDA

Programming Model
These notes will introduce:

•
Basic GPU programming model
•
CUDA kernel
•
Simple CUDA program to add two vectors together
•
Compiling the code on a Linux system

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 20, 2011 1

Programming Model

GPUs historically designed for creating image data for

displays.

That application involves manipulating image pixels

(picture elements) and often the same operation each
pixel

SIMD (single instruction multiple data) model - An

efficient mode of operation in which the same operation
is done on each data element at the same time

2
SIMD (Single Instruction Multiple Data)
model
Also know as data parallel computation.
One instruction specifies the operation:
Instruction
a[] = a[] + k

ALUs

a[0] a[1] a[n-2] a[n-1]

Very efficient of this is what you want to do. One program.

Can design computers to operate this way.
3
Single Instruction Multiple Thread
Programming Model

A version of SIMD used in GPUs.

GPUs use a thread model to achieve very high parallel

performance and to hide memory latency

Multiple threads, each execute the same instruction sequence.

On a GPU, a very large number of threads (10,000’s) possible.

Threads mapped onto available processors on GPU (100’s of

processors all executing same program sequence)

4
Programming applications
using SIMT model
Matrix operations -- very amenable to SIMT
•
Same operations done on different elements of matrices

Some “embarassingly” parallel computations such as

Monte Carlo calculations
•
Monte Carlo calculations use random selections
Random selections are independent of each other

Data manipulations
•
Some sorting can be done quite efficiently

…
5
CUDA kernel routine
To write a SIMT program, one needs to write a code
sequence that all the threads on the GPU will do.

In CUDA, this code sequence is called a Kernel routine

Kernal code will be regular C except one typically needs

to use the thread ID in expressions to ensure each thread
accesses different data:
Example
…
All theads do this
index = ThreadID;
A[index] = B[index] + C[index];
6
CPU and GPU memory
•
Program once compiled has code
executed on CPU and (kernel) code
executed on GPU CPU

CPU main memory

•
Separate memories on CPU and GPU
Copy from Copy from
CPU to GPU to
Need to GPU CPU
•
Explicitly transfer data from CPU to
GPU for GPU computation, and GPU global memory

GPU
•
Explicitly transfer results in GPU
memory copied back to CPU memory

7
Basic CUDA program structure
int main (int argc, char **argv ) {

1. Allocate memory space in device (GPU) for data

2. Allocate memory space in host (CPU) for data

3. Copy data to GPU

4. Call “kernel” routine to execute on GPU

(with CUDA syntax that defines no of threads and their physical structure)

5. Transfer results from GPU to CPU

6. Free memory space in device (GPU)

7. Free memory space in host (CPU)

return;
}
8
1. Allocating memory space in
“device” (GPU) for data

Use CUDA malloc routines:

int size = N *sizeof( int); // space for N integers

int devA, devB, *devC; // devA, devB, devC ptrs

cudaMalloc( (void**)&devA, size) );

cudaMalloc( (void**)&devB, size );
cudaMalloc( (void**)&devC, size );

9
Derived from Jason Sanders, "Introduction to CUDA C" GPU technology conference, Sept. 20, 2010.
2. Allocating memory space in
“host” (CPU) for data
Use regular C malloc routines:
int *a, *b, *c;
…
a = (int*)malloc(size);
b = (int*)malloc(size);
c = (int*)malloc(size);

or statically declare variables:

#define N 256
…
int a[N], b[N], c[N];

10
3. Transferring data from host
(CPU) to device (GPU)
Use CUDA routine cudaMemcpy
Destination Source

cudaMemcpy( devA, A, size, cudaMemcpyHostToDevice);

cudaMemcpy( dev_B, B, size, cudaMemcpyHostToDevice);

where:
devA and devB are pointers to destination in device
A and B are pointers to host data
11
4. Declaring “kernel” routine to
execute on device (GPU)
CUDA introduces a syntax addition to C:
Triple angle brackets mark call from host code to device code.
Contains organization and number of threads in two parameters:

myKernel<<< n, m >>>(arg1, … );

n and m will define organization of thread blocks and threads in a

block.

For now, we will set n = 1, which say one block and m = N, which
says N threads in this block.

arg1, … , -- arguments to routine myKernel typically pointers to

device memory obtained previously from cudaMallac.
12
Declaring a Kernel Routine
Two
A kernel defined using CUDA specifier __global__ underscores
each side

Example – Adding to vectors A and B

#define N 256
__global__ void vecAdd(int *A, int *B, int *C) { // Kernel definition

int i = threadIdx.x; CUDA structure that provides thread ID in block

C[i] = A[i] + B[i];
} Each of the N threads performs one pair-
wise addition:
int main() { Thread 0: devC[0] = devA[0] + devB[0];
// allocate device memory & Thread 1: devC[1] = devA[1] + devB[1];
// copy data to device Thread N-1: devC[N-1] = devA[N-1]+devB[N-1];
// device mem. ptrs devA,devB,devC

vecAdd<<<1, N>>>(devA,devB,devC); // Grid of one block, N threads in block

…
}
13
Loosely derived from CUDA C programming guide, v 3.2 , 2010, NVIDIA
5. Transferring data from device
(GPU) to host (CPU)

Use CUDA routine cudaMemcpy

Destination Source

cudaMemcpy( C, devC, size, cudaMemcpyDeviceToHost);

where:
devC is a pointer in device and C is a pointer in host.

14
6. Free memory space in “device”
(GPU)

Use CUDA cudaFree routine:

cudaFree( dev_a);
cudaFree( dev_b);
cudaFree( dev_c);

15
7. Free memory space in (CPU) host
(if CPU memory allocated with malloc)

Use regular C free routine to deallocate memory if

previously allocated with malloc:

free( a );
free( b );
free( c );

16
#define N 256
Complete
__global__ void vecAdd(int *A, int *B, int *C) {
CUDA int i = threadIdx.x;
C[i] = A[i] + B[i];
program }

int main (int argc, char **argv ) {

int size = N *sizeof( int);

Adding two int a[N], b[N], c[N], *devA, *devB, *devC;
vectors, A and cudaMalloc( (void**)&devA, size) );
B cudaMalloc( (void**)&devB, size );
cudaMalloc( (void**)&devC, size );
N elements in A
and B, and cudaMemcpy( devA, a, size, cudaMemcpyHostToDevice);
cudaMemcpy( devB, b size, cudaMemcpyHostToDevice);
N threads vecAdd<<<1, N>>>(devA, devB, devC);

(without code to cudaMemcpy( c, devC size, cudaMemcpyDeviceToHost);

load arrays with
data) cudaFree( dev_a);
cudaFree( dev_b);
cudaFree( dev_c);

return (0); 17
int main(int argc, char *argv[]) {
int T = 10, B = 1; // threads per block/blocks per grid
Complete, with int a[N],b[N],c[N];
int *dev_a, *dev_b, *dev_c;
keyboard input for printf("Size of array = %d\n", N);
blocks/threads do {
printf("Enter number of threads per block: ");
scanf("%d",&T);
printf("\nEnter nuumber of blocks per grid: ");
(without timing execution, scanf("%d",&B);
see later) if (T * B < N) printf("Error T x B < N, try again");
} while (T * B < N);

cudaMalloc((void**)&dev_a,N * sizeof(int));
cudaMalloc((void**)&dev_b,N * sizeof(int));
cudaMalloc((void**)&dev_c,N * sizeof(int));
#include <stdio.h>
#include <cuda.h> for(int i=0;i<N;i++) { // load arrays with some numbers
#include <stdlib.h> a[i] = i;
#include <time.h> b[i] = i*1;
}
#define N 4096 // size of array
cudaMemcpy(dev_a, a , N*sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b , N*sizeof(int),cudaMemcpyHostToDevice);
__global__ void add(int *a,int *b, int *c) { cudaMemcpy(dev_c, c , N*sizeof(int),cudaMemcpyHostToDevice);
int tid = blockIdx.x*blockDim.x +
threadIdx.x; add<<<B,T>>>(dev_a,dev_b,dev_c);

if(tid < N){ cudaMemcpy(c,dev_c,N*sizeof(int),cudaMemcpyDeviceToHost);

c[tid] = a[tid]+b[tid];
} for(int i=0;i<N;i++) {
printf("%d+%d=%d\n",a[i],b[i],c[i]);
} }

cudaFree(dev_a); // clean up
cudaFree(dev_b);
cudaFree(dev_c);
18
return 0;
Compiling CUDA programs
“nvcc”
NVIDIA provides nvcc -- the NVIDIA CUDA “compiler
driver”.

Will separate out code for host and for device

Regular C/C++ compiler used for host (needs to be

available)

Programmer simply uses nvcc instead of gcc/cc compiler

on a Linux system

Command line options include for GPU features

19
Compiling code - Linux

Command line: Directories for #include files

nvcc –O3 –o <exe> <source_file> -I/usr/local/cuda/include

–L/usr/local/cuda/lib –lcuda –lcudart
Optimization level if
you want optimized Directories for libraries Libraries to be linked
code

CUDA source file that includes device code has the extension .cu
nvcc separates code for CPU and for GPU and compiles code.
Need regular C compiler installed for CPU.
Make file convenient – see next.

See “The CUDA Compiler Driver NVCC” from NVIDIA for more details 20
Very simple sample Make file
NVCC = /usr/local/cuda/bin/nvcc
CUDAPATH = /usr/local/cuda

NVCCFLAGS = -I$(CUDAPATH)/include
LFLAGS = -L$(CUDAPATH)/lib64 -lcuda -lcudart -lm

prog1: A regular C program

cc -o prog1 prog1.c –lm
A C program with X11 graphics
prog2:
cc -I/usr/openwin/include -o prog2 prog2.c -L/usr/openwin/lib -L/usr/X11R6/lib
-lX11 –lm
A CUDA program
prog3:
$(NVCC) $(NVCCFLAGS) $(LFLAGS) -o prog3 prog3.cu
A CUDA program with X11 graphics
prog4:
$(NVCC) $(NVCCFLAGS) $(LFLAGS) -I/usr/openwin/include -o prog4
prog4.cu -L/usr/openwin/lib -L/usr/X11R6/lib -lX11 -lm
21
Compilation process
nvcc “wrapper” divides nvcc –o prog prog.cu –I/includepath -L/libpath
code into host and
device parts.
nvcc
Host part compiled by
regular C compiler
ptxas gcc
Device part compiled
by NVIDIA “ptxas” Combine
assembler Object file

Two compiled parts executable

combined into one
executable Executable file a “fat” binary” with
both host and device code 22
Executing Program

Simple type name of executable created by nvcc:

./prog1

File includes all the code for host and for device in a “fat binary” file

Host code starts running

When first encounter device kernel, GPU code physically sent to

GPU and function launched on GPU
Hence first launch will be slow!!

Run time environment (cudart) controls memcpy timing and

synchronization
23
Questions

Bookshop C++ OOP ASSIGNMENT SPPU
No ratings yet
Bookshop C++ OOP ASSIGNMENT SPPU
5 pages
Lecture3 Fundamentals of CUDA(Part1)_2025
No ratings yet
Lecture3 Fundamentals of CUDA(Part1)_2025
52 pages
CUDA PPT Anurita Unit3
No ratings yet
CUDA PPT Anurita Unit3
42 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
CUDA
No ratings yet
CUDA
33 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
CUDA Programming Invert
No ratings yet
CUDA Programming Invert
36 pages
Introduction To CUDA C 3
No ratings yet
Introduction To CUDA C 3
67 pages
01 Cuda c Basics
No ratings yet
01 Cuda c Basics
32 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
Lec 1
No ratings yet
Lec 1
27 pages
Intro To CUDA
No ratings yet
Intro To CUDA
76 pages
CUDA_1
No ratings yet
CUDA_1
45 pages
Lecture-12-GPU-Programming
No ratings yet
Lecture-12-GPU-Programming
65 pages
Threads
No ratings yet
Threads
54 pages
Cuda C
No ratings yet
Cuda C
70 pages
21.L18 Intro To GPU and CUDA C
No ratings yet
21.L18 Intro To GPU and CUDA C
89 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Introduction To CUDA C
No ratings yet
Introduction To CUDA C
67 pages
Lecture12 GPUArchCUDA02-CUDAMem
No ratings yet
Lecture12 GPUArchCUDA02-CUDAMem
67 pages
CUDA_part-1
No ratings yet
CUDA_part-1
52 pages
3-computation
No ratings yet
3-computation
28 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
GPU Series III CUDA Compilation Host Side 1721302802
No ratings yet
GPU Series III CUDA Compilation Host Side 1721302802
8 pages
Introduccion CUDA C
No ratings yet
Introduccion CUDA C
51 pages
2023-CSC14120-Lecture01-CUDAIntroduction
No ratings yet
2023-CSC14120-Lecture01-CUDAIntroduction
32 pages
Cuda C/C++ Basics: NVIDIA Corporation
No ratings yet
Cuda C/C++ Basics: NVIDIA Corporation
67 pages
Basic-Cuda
No ratings yet
Basic-Cuda
49 pages
A Beginner'S Guide To Programming Gpus With Cuda: Mike Peardon
No ratings yet
A Beginner'S Guide To Programming Gpus With Cuda: Mike Peardon
21 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
04 IntroductionGPUsCUDA
No ratings yet
04 IntroductionGPUsCUDA
25 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
cs179 2016 Lec13
No ratings yet
cs179 2016 Lec13
30 pages
Introduction To Gpu Programming With Cuda and Openacc
100% (1)
Introduction To Gpu Programming With Cuda and Openacc
40 pages
CUDA Programming: Johan Seland Johan - Seland@sintef - No
No ratings yet
CUDA Programming: Johan Seland Johan - Seland@sintef - No
76 pages
LM32_AIT_L21
No ratings yet
LM32_AIT_L21
19 pages
CUDA Programming Model
No ratings yet
CUDA Programming Model
14 pages
1 Cuda
100% (1)
1 Cuda
173 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
4. CUDA Programming
No ratings yet
4. CUDA Programming
35 pages
Aca Lab Manual Final
No ratings yet
Aca Lab Manual Final
28 pages
LPIC-1 Primer
From Everand
LPIC-1 Primer
John Greene
4.5/5 (3)
C Programming for the Pc the Mac and the Arduino Microcontroller System
From Everand
C Programming for the Pc the Mac and the Arduino Microcontroller System
Peter D Minns
No ratings yet
CUDA Programming with C++: From Basics to Expert Proficiency
From Everand
CUDA Programming with C++: From Basics to Expert Proficiency
William Smith
No ratings yet
GAP Lecture 1
No ratings yet
GAP Lecture 1
24 pages
chouFasman
No ratings yet
chouFasman
6 pages
GAP lecture 5
No ratings yet
GAP lecture 5
27 pages
GAP6
No ratings yet
GAP6
10 pages
1742-4682-2-18(1)
No ratings yet
1742-4682-2-18(1)
11 pages
GAP Lecture 2
No ratings yet
GAP Lecture 2
21 pages
Biochemical Calculations by Erwin Segel
No ratings yet
Biochemical Calculations by Erwin Segel
458 pages
Lecture_5_GridComputing-2014
No ratings yet
Lecture_5_GridComputing-2014
39 pages
R NGS
No ratings yet
R NGS
29 pages
ChIPSeq
No ratings yet
ChIPSeq
27 pages
Brief Bioinform-2010-Li-473-83
No ratings yet
Brief Bioinform-2010-Li-473-83
11 pages
big.2013.0036
No ratings yet
big.2013.0036
6 pages
ALIGN VIEW
No ratings yet
ALIGN VIEW
9 pages
1o9u.pdb (Renum - 1, Water & Ligand Remove) : 1. Extract The Residues Sequence by Using The Following Script
No ratings yet
1o9u.pdb (Renum - 1, Water & Ligand Remove) : 1. Extract The Residues Sequence by Using The Following Script
6 pages
Collections Framework in Java
No ratings yet
Collections Framework in Java
28 pages
CS201 Midterm Short Notes and Subjectives Question and Answer Danish Hanif
No ratings yet
CS201 Midterm Short Notes and Subjectives Question and Answer Danish Hanif
13 pages
1Z0-061 Sample Questions Answers PDF
No ratings yet
1Z0-061 Sample Questions Answers PDF
6 pages
Penerapan String Matching Menggunakan Algoritma Boyer-Moore Pada Translator Bahasa Pascal Ke C
No ratings yet
Penerapan String Matching Menggunakan Algoritma Boyer-Moore Pada Translator Bahasa Pascal Ke C
14 pages
Designing Databases: Jeffrey A. Hoffer Joey F. George Joseph S. Valacich
No ratings yet
Designing Databases: Jeffrey A. Hoffer Joey F. George Joseph S. Valacich
38 pages
Week 6_Practice Exercises
No ratings yet
Week 6_Practice Exercises
3 pages
Scilab and Scicos Revised
No ratings yet
Scilab and Scicos Revised
143 pages
2QL S4hana2022 BPD en Us
No ratings yet
2QL S4hana2022 BPD en Us
43 pages
A New Algorithm For Parallel Connected-Component Labelling On Gpus
No ratings yet
A New Algorithm For Parallel Connected-Component Labelling On Gpus
14 pages
Abdullah CF&P Report. NUMBER SYSTEM CONVERSION
No ratings yet
Abdullah CF&P Report. NUMBER SYSTEM CONVERSION
38 pages
ISPF - Features
No ratings yet
ISPF - Features
49 pages
Chirag Ramesh Saroj: - Redesyn
No ratings yet
Chirag Ramesh Saroj: - Redesyn
1 page
Windows Scripting Host Programmer's ReferenceProgrammer's Reference
100% (1)
Windows Scripting Host Programmer's ReferenceProgrammer's Reference
44 pages
COL106: Data Structures and Algorithms: Ragesh Jaiswal, IITD
No ratings yet
COL106: Data Structures and Algorithms: Ragesh Jaiswal, IITD
22 pages
Viva Questions
No ratings yet
Viva Questions
12 pages
ICSE Class 10 Full Computer Theory Notes
No ratings yet
ICSE Class 10 Full Computer Theory Notes
9 pages
Web Programming Lab Manual 26 May
No ratings yet
Web Programming Lab Manual 26 May
26 pages
Ltsot Scheduler
No ratings yet
Ltsot Scheduler
136 pages
Top 50 Questions of Basic C Programming Asked in Interviews
No ratings yet
Top 50 Questions of Basic C Programming Asked in Interviews
2 pages
Built in Data Type
No ratings yet
Built in Data Type
19 pages
How to Learn Python
No ratings yet
How to Learn Python
2 pages
C Codes For Student
No ratings yet
C Codes For Student
41 pages
Test Cases of Calculator (Software Testing)
No ratings yet
Test Cases of Calculator (Software Testing)
9 pages
GTU Report
No ratings yet
GTU Report
38 pages
Different Method To Passing Parameter:: Written by Category: Published: 11 March 2015
No ratings yet
Different Method To Passing Parameter:: Written by Category: Published: 11 March 2015
6 pages
Lab 2
No ratings yet
Lab 2
2 pages
Excel-DNA - Step-By-Step C# Add-In
No ratings yet
Excel-DNA - Step-By-Step C# Add-In
10 pages
HP VuGen Certification
No ratings yet
HP VuGen Certification
4 pages
Double Click Event in OOPS ALV
No ratings yet
Double Click Event in OOPS ALV
14 pages