Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

CUDA Introduction Mod

The document provides an introduction to CUDA, a parallel computing platform and API by Nvidia designed for general-purpose computing on GPUs. It explains the differences between CPUs and GPUs, outlines the hardware and software requirements for CUDA programming, and details the process of memory allocation and execution of kernel functions on the GPU. Additionally, it includes examples of CUDA code and performance tips for optimizing block and grid sizes.

Uploaded by

siranjiv
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

CUDA Introduction Mod

The document provides an introduction to CUDA, a parallel computing platform and API by Nvidia designed for general-purpose computing on GPUs. It explains the differences between CPUs and GPUs, outlines the hardware and software requirements for CUDA programming, and details the process of memory allocation and execution of kernel functions on the GPU. Additionally, it includes examples of CUDA code and performance tips for optimizing block and grid sizes.

Uploaded by

siranjiv
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

CUDA: Introduction

Christian Trefftz / Greg Wolffe


Grand Valley State University
Supercomputing 2008
Education Program
(modifications by Jernej Barbic, 2008-2019)
Terms
Ø What is GPGPU?
l General-Purpose computing on a Graphics

Processing Unit
l Using graphic hardware for non-graphic

computations

Ø What is CUDA?
l Parallel computing platform and API by Nvidia

l Compute Unified Device Architecture

l Software architecture for managing data-parallel

programming
l Introduced in 2007; still actively updated 2
Motivation

3
Motivation

4
Motivation

5
CPU vs. GPU
Ø CPU
l Fast caches
l Branching adaptability
l High performance
Ø GPU
l Multiple ALUs
l Fast onboard memory
l High throughput on parallel tasks
• Executes program on each fragment/vertex

Ø CPUs are great for task parallelism


Ø GPUs are great for data parallelism

6
CPU vs. GPU - Hardware

Ø More transistors devoted to data processing

7
Traditional Graphics Pipeline
Vertex processing
ò
Rasterizer
ò
Fragment processing
ò
Renderer (textures)

8
Pixel / Thread Processing

9
GPU Architecture

10
Processing Element

Ø Processing element = thread processor

11
GPU Memory Architecture
Uncached:
Ø Registers
Ø Shared Memory
Ø Local Memory
Ø Global Memory

Cached:
Ø Constant Memory
Ø Texture Memory

12
Data-parallel Programming
Ø Think of the GPU as a massively-threaded
co-processor
Ø Write “kernel” functions that execute on
the device -- processing multiple data
elements in parallel

Ø Keep it busy! [ massive threading


Ø Keep your data close! [ local memory

13
Hardware Requirements
Ø CUDA-capable
video card
Ø Power supply
Ø Cooling
Ø PCI-Express

14
A Gentle Introduction to
CUDA Programming

17
Credits
Ø Thecode used in this presentation is based
on code available in:
l the Tutorial on CUDA in Dr. Dobbs Journal
l Andrew Bellenir’s code for matrix multiplication
l Igor Majdandzic’s code for Voronoi diagrams
l NVIDIA’s CUDA programming guide

18
Software Requirements/Tools

Ø CUDA device driver


Ø CUDA Toolkit (compiler, CUBLAS, CUFFT)
Ø CUDA Software Development Kit
l Emulator

Profiling:
Ø Occupancy calculator
Ø Visual profiler

19
To compute, we need to:
Ø Allocate memory for the computation
on the GPU (incl. variables)
Ø Provide input data
Ø Specify the computation to be performed
Ø Read the results from the GPU (output)

20
Initially:

array

CPU Memory GPU Card’s Memory

21
Allocate Memory in the GPU
card

array array_d

Host’s Memory GPU Card’s Memory

22
Copy content from the host’s memory to the
GPU card memory

array array_d

Host’s Memory GPU Card’s Memory

23
Execute code on the GPU

GPU MPs

array array_d

Host’s Memory GPU Card’s Memory

24
Copy results back to the host
memory

array array_d

Host’s Memory GPU Card’s Memory

25
The Kernel
Ø The code to be executed in the
stream processors on the GPU

Ø Simultaneous execution in
several (perhaps all) stream
processors on the GPU

Ø How is every instance of the


kernel going to know which
piece of data it is working on?

26
Grid and Block Size

l Grid size: The number of blocks


• Can be 1 or 2-dimensional array of blocks

l Each block is divided into threads


• Can be 1, 2, or 3-dimensional array of threads

27
Let’s look at a very simple example
Ø The code has been divided into two files:
l simple.c
l simple.cu
Ø simple.c is ordinary code in C
Ø It allocates an array of integers, initializes
it to values corresponding to the indices in
the array and prints the array.
Ø It calls a function that modifies the array
Ø The array is printed again.

28
simple.c
Ø
#include <stdio.h>
#define SIZEOFARRAY 64
extern void fillArray(int *a,int size);
/* The main program */
int main(int argc,char *argv[])
{
/* Declare the array that will be modified by the GPU */
int a[SIZEOFARRAY];
int i;
/* Initialize the array to 0s */
for(i=0;i < SIZEOFARRAY;i++) {
a[i]=0;
}
/* Print the initial array */
printf("Initial state of the array:\n");
for(i = 0;i < SIZEOFARRAY;i++) {
printf("%d ",a[i]);
}
printf("\n");
/* Call the function that will in turn call the function in the GPU that will fill
the array */
fillArray(a,SIZEOFARRAY);
/* Now print the array after calling fillArray */
printf("Final state of the array:\n");
for(i = 0;i < SIZEOFARRAY;i++) {
printf("%d ",a[i]);
}
printf("\n");
return 0;
}

29
simple.cu
Ø simple.cu contains two functions
l fillArray(): A function that will be executed on
the host and which takes care of:
• Allocating variables in the global GPU memory
• Copying the array from the host to the GPU memory
• Setting the grid and block sizes
• Invoking the kernel that is executed on the GPU
• Copying the values back to the host memory
• Freeing the GPU memory

30
fillArray (part 1)
#define BLOCK_SIZE 32
extern "C" void fillArray(int *array, int arraySize)
{
int * array_d;
cudaError_t result;

/* cudaMalloc allocates space in GPU memory */


result =
cudaMalloc((void**)&array_d,sizeof(int)*arraySize);

/* copy the CPU array into the GPU array_d */


result = cudaMemcpy(array_d,array,sizeof(int)*arraySize,
cudaMemcpyHostToDevice);

31
fillArray (part 2)
/* Indicate block size */
dim3 dimblock(BLOCK_SIZE);
/* Indicate grid size */
dim3 dimgrid(arraySize / BLOCK_SIZE);

/* Call the kernel */


cu_fillArray<<<dimgrid,dimblock>>>(array_d);

/* Copy the results from GPU back to CPU memory */


result =
cudaMemcpy(array,array_d,sizeof(int)*arraySize,cudaMemcpyDevice
ToHost);

/* Release the GPU memory */


cudaFree(array_d);
}

32
simple.cu (cont.)
Ø The other function in simple.cu is cu_fillArray():

l This is the GPU kernel

l Identified by the keyword: __global__

l Built-in variables:
• blockIdx.x : block index within the grid
• threadIdx.x: thread index within the block

33
cu_fillArray
__global__ void cu_fillArray(int * array_d)
{
int x;
x = blockIdx.x * BLOCK_SIZE + threadIdx.x;
array_d[x] = x;
}

__global__ void cu_addIntegers(int * array_d1, int * array_d2)


{
int x;
x = blockIdx.x * BLOCK_SIZE + threadIdx.x;
array_d1[x] += array_d2[x];
}

34
To compile:
Ø nvcc simple.c simple.cu –o simple
Ø The compiler generates the code for both
the host and the GPU
Ø Demo on cuda.littlefe.net …

35
In the GPU:

Processing Elements

Thread Thread Thread Thread Thread Thread Thread Thread


0 1 2 3 0 1 2 3

Array Elements
Block 0 Block 1

37
Another Example: saxpy
Ø SAXPY (Scalar Alpha X Plus Y)
l A common operation in linear algebra
Ø CUDA: loop iteration ð thread

41
Traditional Sequential Code
void saxpy_serial(int n,
float alpha,
float *x,
float *y)
{
for(int i = 0;i < n;i++)
y[i] = alpha*x[i] + y[i];
}

42
CUDA Code
__global__ void saxpy_parallel(int n,
float alpha,
float *x,
float *y) {
int i = blockIdx.x*blockDim.x+threadIdx.x;
if (i<n)
y[i] = alpha*x[i] + y[i];
}

43
“Warps”
Ø Each block is split into SIMD groups of threads
called "warps".

Ø Each warp contains the same number of threads,


called the "warp size”

44
warp 1

warp 2 Block 1

warp 3
threads

warp 1
Block 2

warp 2

warp 3

warp 1
Block 3

warp 2

warp 3
Multi-processor 1

warp 1
Block 4

warp 2

warp 3
45
Keeping multiprocessors in mind…
Ø Each multiprocessor can process multiple blocks at a
time.

Ø How many depends on the number of registers per


thread and how much shared memory per block is
required by a given kernel.

Ø If a block is too large, it will not fit into the resources of


an MP.

46
Performance Tip: Block Size

Ø Critical for performance


Ø Recommended value is 192 or 256
Ø Maximum value is 512
Ø Should be a multiple of 32 since this is the warp
size for Series 8 GPUs and thus the native
execution size for multiprocessors
Ø Limited by number of registers on the MP
Ø Series 8 GPU MPs have 8192 registers which
are shared between all the threads on an MP

47
Performance Tip:
Grid Size (number of blocks)
Ø Recommended value is at least 100, but 1000 would
scale for many generations of hardware

Ø Actual value depends on problem size

Ø It should be a multiple of the number of MPs for an even


distribution of work (not a requirement though)

Ø Example: 24 blocks
l Grid will work efficiently on Series 8 (12 MPs), but it will waste
resources on new GPUs with 32MPs

48
Example: Tesla P100
Ø Launched in 2016
Ø “Pascal” architecture (successors: Volta, Turing)
Ø Double-precision performance: 4.7 TeraFLOPS
Ø Single-precision performance: 9.3 TeraFLOPS
Ø GPU Memory: 16 GB

49
Example: Tesla P100
Ø Number of Multiprocessors (MPs): 56
Ø Number of Cuda Cores per MP: 64
Ø Total number of Cuda Cores: 3584
Ø #Cuda Cores = #number of floating point
instructions that can be processed per cycle
Ø MPs can run multiple threads per core
simultaneously (similar to hyperthreading on CPU)
Ø Hence, #threads can be larger than #cores

50
Memory Alignment
Ø Memory access faster if data aligned at 64
byte boundaries

Ø Hence, allocate 2D arrays so that every


row starts at a 64-byte boundary

Ø Tedious for a programmer

51
Allocating 2D arrays with “pitch”
Ø CUDA offers special versions of:

l Memory allocation of 2D arrays so that every row


is padded (if necessary): cudaMallocPitch()

l Memory copy operations that take into account the


pitch: cudaMemcpy2D()

52
Pitch
Columns

Padding

Rows

Pitch

53
Dividing the work by blocks:
Columns

Block 0

Rows Block 1

Block 2

Pitch

60
Watchdog timer
Ø OS may force programs using the GPU to time out if
running too long

Ø Exceeding the limit can cause CUDA program


failure.

Ø Possible solution: run CUDA on a GPU that is NOT


attached to a display.

65
Resources on line
Ø http://www.acmqueue.org/modules.php?name=
Content&pa=showpage&pid=532
Ø http://www.ddj.com/hpc-high-performance-
computing/207200659
Ø http://www.nvidia.com/object/cuda_home.html#
Ø http://www.nvidia.com/object/cuda_learn.html
Ø “Computation of Voronoi diagrams using a
graphics processing unit” by Igor Majdandzic et
al. available through IEEE Digital Library, DOI:
10.1109/EIT.2008.4554342

66

You might also like