0% found this document useful (0 votes)

2 views

CUDA Introduction Mod

The document provides an introduction to CUDA, a parallel computing platform and API by Nvidia designed for general-purpose computing on GPUs. It explains the differences between CPUs and GPUs, outlines the hardware and software requirements for CUDA programming, and details the process of memory allocation and execution of kernel functions on the GPU. Additionally, it includes examples of CUDA code and performance tips for optimizing block and grid sizes.

Uploaded by

siranjiv

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

CUDA Introduction Mod

Uploaded by

siranjiv

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

CUDA: Introduction

Christian Trefftz / Greg Wolffe

Grand Valley State University
Supercomputing 2008
Education Program
(modifications by Jernej Barbic, 2008-2019)
Terms
Ø What is GPGPU?
l General-Purpose computing on a Graphics

Processing Unit
l Using graphic hardware for non-graphic

computations

Ø What is CUDA?
l Parallel computing platform and API by Nvidia

l Compute Unified Device Architecture

l Software architecture for managing data-parallel

programming
l Introduced in 2007; still actively updated 2
Motivation

3
Motivation

4
Motivation

5
CPU vs. GPU
Ø CPU
l Fast caches
l Branching adaptability
l High performance
Ø GPU
l Multiple ALUs
l Fast onboard memory
l High throughput on parallel tasks
• Executes program on each fragment/vertex

Ø CPUs are great for task parallelism

Ø GPUs are great for data parallelism

6
CPU vs. GPU - Hardware

Ø More transistors devoted to data processing

7
Traditional Graphics Pipeline
Vertex processing
ò
Rasterizer
ò
Fragment processing
ò
Renderer (textures)

8
Pixel / Thread Processing

9
GPU Architecture

10
Processing Element

Ø Processing element = thread processor

11
GPU Memory Architecture
Uncached:
Ø Registers
Ø Shared Memory
Ø Local Memory
Ø Global Memory

Cached:
Ø Constant Memory
Ø Texture Memory

12
Data-parallel Programming
Ø Think of the GPU as a massively-threaded
co-processor
Ø Write “kernel” functions that execute on
the device -- processing multiple data
elements in parallel

Ø Keep it busy! [ massive threading

Ø Keep your data close! [ local memory

13
Hardware Requirements
Ø CUDA-capable
video card
Ø Power supply
Ø Cooling
Ø PCI-Express

14
A Gentle Introduction to
CUDA Programming

17
Credits
Ø Thecode used in this presentation is based
on code available in:
l the Tutorial on CUDA in Dr. Dobbs Journal
l Andrew Bellenir’s code for matrix multiplication
l Igor Majdandzic’s code for Voronoi diagrams
l NVIDIA’s CUDA programming guide

18
Software Requirements/Tools

Ø CUDA device driver

Ø CUDA Toolkit (compiler, CUBLAS, CUFFT)
Ø CUDA Software Development Kit
l Emulator

Profiling:
Ø Occupancy calculator
Ø Visual profiler

19
To compute, we need to:
Ø Allocate memory for the computation
on the GPU (incl. variables)
Ø Provide input data
Ø Specify the computation to be performed
Ø Read the results from the GPU (output)

20
Initially:

array

CPU Memory GPU Card’s Memory

21
Allocate Memory in the GPU
card

array array_d

Host’s Memory GPU Card’s Memory

22
Copy content from the host’s memory to the
GPU card memory

array array_d

Host’s Memory GPU Card’s Memory

23
Execute code on the GPU

GPU MPs

array array_d

Host’s Memory GPU Card’s Memory

24
Copy results back to the host
memory

array array_d

Host’s Memory GPU Card’s Memory

25
The Kernel
Ø The code to be executed in the
stream processors on the GPU

Ø Simultaneous execution in
several (perhaps all) stream
processors on the GPU

Ø How is every instance of the

kernel going to know which
piece of data it is working on?

26
Grid and Block Size

l Grid size: The number of blocks

• Can be 1 or 2-dimensional array of blocks

l Each block is divided into threads

• Can be 1, 2, or 3-dimensional array of threads

27
Let’s look at a very simple example
Ø The code has been divided into two files:
l simple.c
l simple.cu
Ø simple.c is ordinary code in C
Ø It allocates an array of integers, initializes
it to values corresponding to the indices in
the array and prints the array.
Ø It calls a function that modifies the array
Ø The array is printed again.

28
simple.c
Ø
#include <stdio.h>
#define SIZEOFARRAY 64
extern void fillArray(int *a,int size);
/* The main program */
int main(int argc,char *argv[])
{
/* Declare the array that will be modified by the GPU */
int a[SIZEOFARRAY];
int i;
/* Initialize the array to 0s */
for(i=0;i < SIZEOFARRAY;i++) {
a[i]=0;
}
/* Print the initial array */
printf("Initial state of the array:\n");
for(i = 0;i < SIZEOFARRAY;i++) {
printf("%d ",a[i]);
}
printf("\n");
/* Call the function that will in turn call the function in the GPU that will fill
the array */
fillArray(a,SIZEOFARRAY);
/* Now print the array after calling fillArray */
printf("Final state of the array:\n");
for(i = 0;i < SIZEOFARRAY;i++) {
printf("%d ",a[i]);
}
printf("\n");
return 0;
}

29
simple.cu
Ø simple.cu contains two functions
l fillArray(): A function that will be executed on
the host and which takes care of:
• Allocating variables in the global GPU memory
• Copying the array from the host to the GPU memory
• Setting the grid and block sizes
• Invoking the kernel that is executed on the GPU
• Copying the values back to the host memory
• Freeing the GPU memory

30
fillArray (part 1)
#define BLOCK_SIZE 32
extern "C" void fillArray(int *array, int arraySize)
{
int * array_d;
cudaError_t result;

/* cudaMalloc allocates space in GPU memory */

result =
cudaMalloc((void**)&array_d,sizeof(int)*arraySize);

/* copy the CPU array into the GPU array_d */

result = cudaMemcpy(array_d,array,sizeof(int)*arraySize,
cudaMemcpyHostToDevice);

31
fillArray (part 2)
/* Indicate block size */
dim3 dimblock(BLOCK_SIZE);
/* Indicate grid size */
dim3 dimgrid(arraySize / BLOCK_SIZE);

/* Call the kernel */

cu_fillArray<<<dimgrid,dimblock>>>(array_d);

/* Copy the results from GPU back to CPU memory */

result =
cudaMemcpy(array,array_d,sizeof(int)*arraySize,cudaMemcpyDevice
ToHost);

/* Release the GPU memory */

cudaFree(array_d);
}

32
simple.cu (cont.)
Ø The other function in simple.cu is cu_fillArray():

l This is the GPU kernel

l Identified by the keyword: global

l Built-in variables:
• blockIdx.x : block index within the grid
• threadIdx.x: thread index within the block

33
cu_fillArray
__global__ void cu_fillArray(int * array_d)
{
int x;
x = blockIdx.x * BLOCK_SIZE + threadIdx.x;
array_d[x] = x;
}

global void cu_addIntegers(int * array_d1, int * array_d2)

{
int x;
x = blockIdx.x * BLOCK_SIZE + threadIdx.x;
array_d1[x] += array_d2[x];
}

34
To compile:
Ø nvcc simple.c simple.cu –o simple
Ø The compiler generates the code for both
the host and the GPU
Ø Demo on cuda.littlefe.net …

35
In the GPU:

Processing Elements

Thread Thread Thread Thread Thread Thread Thread Thread

0 1 2 3 0 1 2 3

Array Elements
Block 0 Block 1

37
Another Example: saxpy
Ø SAXPY (Scalar Alpha X Plus Y)
l A common operation in linear algebra
Ø CUDA: loop iteration ð thread

41
Traditional Sequential Code
void saxpy_serial(int n,
float alpha,
float *x,
float *y)
{
for(int i = 0;i < n;i++)
y[i] = alpha*x[i] + y[i];
}

42
CUDA Code
__global__ void saxpy_parallel(int n,
float alpha,
float *x,
float *y) {
int i = blockIdx.x*blockDim.x+threadIdx.x;
if (i<n)
y[i] = alpha*x[i] + y[i];
}

43
“Warps”
Ø Each block is split into SIMD groups of threads
called "warps".

Ø Each warp contains the same number of threads,

called the "warp size”

44
warp 1

warp 2 Block 1

warp 3
threads

warp 1
Block 2

warp 2

warp 3

warp 1
Block 3

warp 2

warp 3
Multi-processor 1

warp 1
Block 4

warp 2

warp 3
45
Keeping multiprocessors in mind…
Ø Each multiprocessor can process multiple blocks at a
time.

Ø How many depends on the number of registers per

thread and how much shared memory per block is
required by a given kernel.

Ø If a block is too large, it will not fit into the resources of

an MP.

46
Performance Tip: Block Size

Ø Critical for performance

Ø Recommended value is 192 or 256
Ø Maximum value is 512
Ø Should be a multiple of 32 since this is the warp
size for Series 8 GPUs and thus the native
execution size for multiprocessors
Ø Limited by number of registers on the MP
Ø Series 8 GPU MPs have 8192 registers which
are shared between all the threads on an MP

47
Performance Tip:
Grid Size (number of blocks)
Ø Recommended value is at least 100, but 1000 would
scale for many generations of hardware

Ø Actual value depends on problem size

Ø It should be a multiple of the number of MPs for an even

distribution of work (not a requirement though)

Ø Example: 24 blocks
l Grid will work efficiently on Series 8 (12 MPs), but it will waste
resources on new GPUs with 32MPs

48
Example: Tesla P100
Ø Launched in 2016
Ø “Pascal” architecture (successors: Volta, Turing)
Ø Double-precision performance: 4.7 TeraFLOPS
Ø Single-precision performance: 9.3 TeraFLOPS
Ø GPU Memory: 16 GB

49
Example: Tesla P100
Ø Number of Multiprocessors (MPs): 56
Ø Number of Cuda Cores per MP: 64
Ø Total number of Cuda Cores: 3584
Ø #Cuda Cores = #number of floating point
instructions that can be processed per cycle
Ø MPs can run multiple threads per core
simultaneously (similar to hyperthreading on CPU)
Ø Hence, #threads can be larger than #cores

50
Memory Alignment
Ø Memory access faster if data aligned at 64
byte boundaries

Ø Hence, allocate 2D arrays so that every

row starts at a 64-byte boundary

Ø Tedious for a programmer

51
Allocating 2D arrays with “pitch”
Ø CUDA offers special versions of:

l Memory allocation of 2D arrays so that every row

is padded (if necessary): cudaMallocPitch()

l Memory copy operations that take into account the

pitch: cudaMemcpy2D()

52
Pitch
Columns

Padding

Rows

Pitch

53
Dividing the work by blocks:
Columns

Block 0

Rows Block 1

Block 2

Pitch

60
Watchdog timer
Ø OS may force programs using the GPU to time out if
running too long

Ø Exceeding the limit can cause CUDA program

failure.

Ø Possible solution: run CUDA on a GPU that is NOT

attached to a display.

65
Resources on line
Ø http://www.acmqueue.org/modules.php?name=
Content&pa=showpage&pid=532
Ø http://www.ddj.com/hpc-high-performance-
computing/207200659
Ø http://www.nvidia.com/object/cuda_home.html#
Ø http://www.nvidia.com/object/cuda_learn.html
Ø “Computation of Voronoi diagrams using a
graphics processing unit” by Igor Majdandzic et
al. available through IEEE Digital Library, DOI:
10.1109/EIT.2008.4554342

PDF Programming Massively Parallel Processors 4th Edition Wen-Mei W. Hwu download
67% (3)
PDF Programming Massively Parallel Processors 4th Edition Wen-Mei W. Hwu download
65 pages
A Survey of Parallel Programming Models and Tools in The Multi and Many-Core Era
No ratings yet
A Survey of Parallel Programming Models and Tools in The Multi and Many-Core Era
18 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
CUDA
No ratings yet
CUDA
33 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Lec 1
No ratings yet
Lec 1
27 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
GPU Programming: Dr. Florian Ferreira
No ratings yet
GPU Programming: Dr. Florian Ferreira
101 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Lecture-12-GPU-Programming
No ratings yet
Lecture-12-GPU-Programming
65 pages
01 Cuda c Basics
No ratings yet
01 Cuda c Basics
32 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
CUDAProgModel
No ratings yet
CUDAProgModel
24 pages
1 Cuda
100% (1)
1 Cuda
173 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
CSE_lec4_cuda
No ratings yet
CSE_lec4_cuda
91 pages
cuda_mode_lecture2
No ratings yet
cuda_mode_lecture2
33 pages
chapter-8
No ratings yet
chapter-8
58 pages
CUDA Programming On Nvidia Gpus: Mike Giles
No ratings yet
CUDA Programming On Nvidia Gpus: Mike Giles
21 pages
07 cmsc416 Cuda
No ratings yet
07 cmsc416 Cuda
26 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
A Beginner'S Guide To Programming Gpus With Cuda: Mike Peardon
No ratings yet
A Beginner'S Guide To Programming Gpus With Cuda: Mike Peardon
21 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
3-CUDA
No ratings yet
3-CUDA
5 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
IntroGPUs
No ratings yet
IntroGPUs
36 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
CUDA_1
No ratings yet
CUDA_1
45 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
Threads
No ratings yet
Threads
54 pages
Introduction To Gpu Programming With Cuda and Openacc
100% (1)
Introduction To Gpu Programming With Cuda and Openacc
40 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
Introduction To CUDA C
No ratings yet
Introduction To CUDA C
67 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster
No ratings yet
Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster
17 pages
Introduction To CUDA C 3
No ratings yet
Introduction To CUDA C 3
67 pages
Lecture3 Fundamentals of CUDA(Part1)_2025
No ratings yet
Lecture3 Fundamentals of CUDA(Part1)_2025
52 pages
CUDA Programming Invert
No ratings yet
CUDA Programming Invert
36 pages
govind_6
No ratings yet
govind_6
4 pages
CUDA Programming Model
No ratings yet
CUDA Programming Model
14 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
Introduccion CUDA C
No ratings yet
Introduccion CUDA C
51 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
From Everand
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
Rodrigo Copetti
No ratings yet
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
Franco Mario
No ratings yet
Master System Architecture: Architecture of Consoles: A Practical Analysis, #15
From Everand
Master System Architecture: Architecture of Consoles: A Practical Analysis, #15
Rodrigo Copetti
2/5 (1)
C Programming for the Pc the Mac and the Arduino Microcontroller System
From Everand
C Programming for the Pc the Mac and the Arduino Microcontroller System
Peter D Minns
No ratings yet
A New Approach For Parallel Region Growing Algorithm in Image Segmentation Using MATLAB On GPU Architecture
No ratings yet
A New Approach For Parallel Region Growing Algorithm in Image Segmentation Using MATLAB On GPU Architecture
5 pages
ONYX
No ratings yet
ONYX
15 pages
Nvidia p6 Datasheet PDF
No ratings yet
Nvidia p6 Datasheet PDF
2 pages
Betriebsanleitung Nvidia Quatro 5500
No ratings yet
Betriebsanleitung Nvidia Quatro 5500
2 pages
Automatically Converting C/ C++ To Opencl/Cuda: Introduction by David Williams
No ratings yet
Automatically Converting C/ C++ To Opencl/Cuda: Introduction by David Williams
52 pages
A Performance Study of Applying CUDA-Enabled GPU in Polar Hough Transform For Lines
No ratings yet
A Performance Study of Applying CUDA-Enabled GPU in Polar Hough Transform For Lines
4 pages
CUDA
No ratings yet
CUDA
46 pages
HPC Assignments
No ratings yet
HPC Assignments
3 pages
Introduction To GP-GPU and CUDA: High Performance Computing Center Hanoi University of Science & Technology
No ratings yet
Introduction To GP-GPU and CUDA: High Performance Computing Center Hanoi University of Science & Technology
43 pages
Poster IIT Kharagpur 2-5 Nov 2023
No ratings yet
Poster IIT Kharagpur 2-5 Nov 2023
1 page
Object Detection
No ratings yet
Object Detection
10 pages
Programming Assignment #1: - DUE 5PM Friday, January 30 - Turning in Assignment
No ratings yet
Programming Assignment #1: - DUE 5PM Friday, January 30 - Turning in Assignment
4 pages
NoC Based DNN Accelerators
No ratings yet
NoC Based DNN Accelerators
8 pages
CUDA Occupancy Calculator
No ratings yet
CUDA Occupancy Calculator
26 pages
CUDA Compiler Driver NVCC
No ratings yet
CUDA Compiler Driver NVCC
68 pages
OpenACC 3 0
No ratings yet
OpenACC 3 0
149 pages
DaVinci Resolve 12 Configuration Guide
No ratings yet
DaVinci Resolve 12 Configuration Guide
68 pages
vgpu-vs-mig-perf
No ratings yet
vgpu-vs-mig-perf
16 pages
Archibald, Stuart (Abstract)
No ratings yet
Archibald, Stuart (Abstract)
255 pages
CUDA_Toolkit_Release_Notes
No ratings yet
CUDA_Toolkit_Release_Notes
50 pages
Manual Metashape Pro 1.8
No ratings yet
Manual Metashape Pro 1.8
195 pages
Intro To CUDA
No ratings yet
Intro To CUDA
76 pages
CS8076 - GPU Architecture and Programming
No ratings yet
CS8076 - GPU Architecture and Programming
244 pages
Pocket AI User Manual
No ratings yet
Pocket AI User Manual
33 pages
TensorRT Release Notes
No ratings yet
TensorRT Release Notes
66 pages
CuDNN Installation Guide
No ratings yet
CuDNN Installation Guide
13 pages
Cupti
No ratings yet
Cupti
129 pages