cuda_mode_lecture2

The document covers the basics of CUDA programming, emphasizing the importance of parallel computing with GPUs for tasks like simulations and deep learning. It introduces key concepts such as heterogeneous data parallel computing, memory management, and kernel functions, along with practical examples like vector addition and matrix multiplication. The document also outlines the challenges of parallel programming and the goals of the book, which include teaching parallel programming principles and enhancing computational thinking.

Uploaded by

santosh

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

cuda_mode_lecture2

Uploaded by

santosh

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 33

pmpp book

ch. 1-3
CUDA-MODE
Lecture 2
Agenda for Lecture 2
• 1: Introduction
• 2: Heterogeneous data parallel computing
• 3: Multidimensional grids and data
Ch 1: Introduction
• motivation: GPU go brrr, more FLOPS please
• Why? Simulation & world-models (games, weather, proteins,
robotics)
• Bigger models are smarter -> AGI (prevent wars, fix climate, cure
cancer)
• GPUs are the backbone of modern deep learning
• classic software: sequential programs
• higher clock rate trend for CPU slowed in 2003: energy
consumption & heat dissipation
• multi-core CPU came up
• developers had to learn multi-threading (deadlocks, races etc.)
The Power Wall
10
8 Transistors
10 (thousands)
7
10
6
10
5
10 Frequency
4
10 (MHz)
3
10
2
10
1
10
01970 1975 1980 1985 1990 1995 2000 2005 2010 2015 2020

Source: M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, C. Batten (1970-2010 ). K. Rupp

(2010-2017).

(increasing frequency further would make the chip too hot to

cool feasibly)
The rise of CUDA
• CUDA is all about parallel programs (modern software)
• GPUs have (much) higher peak FLOPS than multi-core CPUs
• main principle: divide work among threads
• GPUs focus on execution throughput of massive number of threads
• programs with few threads perform poorly on GPUs
• CPU+GPU: sequential parts on CPU, numerical intensive parts on GPU
• CUDA: Compute Unified Device Architect
• GPGPU: Before CUDA tricks were used to compute with graphics APIs
(OpenGL or Direct3D)
• GPU programming is now attractive for developers (thanks to
massive availability)
Amdahl's Law
• speedup = slow_sys_time / fast_sys_time
• achievable speedup is limited by the parallelizable
portion p of programs

• e.g., if p is 90%, speedup < 10×

• Fortunately, for many real applications, p > 99%

especially for large datasets, and speedups >100× are
attainable
Challenges
• "if you do not care about performance, parallel programming is
very easy"
• designing parallel algorithms in practice harder than sequential
algorithms
e.g. parallelizing recurrent computations requires nonintuitive
thinking (like prefix sum)
• speed is often limited by memory latency/throughput (memory
bound)
• perf of parallel programs can vary dramatically based on input
data characterists
• not all apps are "embarassingly parallel" - synchronization
imposes overhead (waits)
Main Goals of the Book
1. Parallel programming & computational thinking
2. Correct & reliable: debugging function & performance
3. Scalability: regularize and localize memory access

• PMPP aims to build up the foundation for parallel

programming in general
• GPUs as learning vehicle - techniques apply to other
accelerators
• concepts are introduced hands-on as concrete CUDA
examples
Ch 2: Heterogeneous data parallel
computing
• heterogeneous: CPU + GPU
• data parallelism: break work down into computations
that can be executed independently
• Two examples: vector addition & kernel to convert an
RGB image to grayscale
• Independence: each RGB pixel can be converted
individually
• L = r*0.21 + g*0.72 + b*0.07 (L=luminance)
• simple weighted sum
RGB->Grayscale, data
independence
CUDA C
• extends ANSI C with minimal new syntax
• Terminology: CPU = host, GPU = device
• CUDA C source can be mixture of host & device code
• device code functions: kernels
• grid of threads: many threads are launched to execute a
kernel
• CPU & GPU code runs concurrently (overlapped)
• on GPU: don't be afraid of launching many threads
• e.g. one thread pre (output) tensor element is fine
Example: Vector Addition
• vector addition example:
• main concept loop -> threads
• Easily parallelizable: all additions can be computed independently

• Naïve GPU vector addition:

1. Allocate device memory for vectors
2. Transfer inputs host -> device
3. Launch kernel and perform additions
4. Copy device -> host back
5. Free device memory

• normally we keep data on the gpu as long as possible to

asynchronously schedule many kernel launches
Input Vector x:

Input Vector y:

Output Vector z:

• One thread per vector element

CUDA Essentials: Memory allocation
• nvidia devices come with their own DRAM (device)
global memory
(in Ch 5 we learn about other mem types)

• cudaMalloc & cudaFree:

float *A_d;
size_t size = n * sizeof(float); // size in bytes
cudaMalloc((void**)&A_d, size); // pointer to pointer!
...
cudaFree(A_d);
cudaMemcpy: Host <-> Device
Transfer
• Copy data from CPU memory to GPU memory and vice
versa

// copy input vectors to device (host -> device)

cudaMemcpy(A_d, A_h, size, cudaMemcpyHostToDevice);
cudaMemcpy(B_d, B_h, size, cudaMemcpyHostToDevice);

...

// transfer result back to CPU memory (device -> host)

cudaMemcpy(C_h, C_d, size, cudaMemcpyDeviceToHost);
CUDA Error handling
• CUDA functions return `cudaError_t` .. if not
`cudaSuccess` we have a problem ...
• always check returned error status 
Kernel functions fn<<>>
• Launching kernel = grid of threads is launched
• All threads execute the same code: Single program
multiple-data (SPMD)
• Threads are hierarchically organized into grid blocks &
thread blocks
• up to 1024 threads can be in a thread block
Kernel Coordinates
• built-in variables available inside the kernel: blockIdx, threadIdx
• these "coordinates" allow threads (all executing the same code) to
identify what to do (e.g. which portion of the data to process)
• each thread can be uniquely identified by threadIdx & blockIdx
• telephone system analogy: think of blockIdx as the area code and
threadIdx as the local phone number
• built-in blockDim tells us the number of threads in a block
• for vector addition we can calculate the array index of the thread
`int i = blockIdx.x * blockDim.x + threadIdx.x;`
Threads execute the same kernel
code
__global__ & __host__
• declare a kernel function with
__global__
• calling a __global__ function ->
launches new grid of cuda threads
• functions declared with __device__
can be called from within a cuda
thread
• if both __host__ & __device__ are
used in a function declaration CPU &
GPU versions will be compiled
Vector Addition Example
• general strategy: replace loop by grid of threads!
• data sizes might not perfectly divisible by block sizes:
always check bounds
• prevent threads of boundary block to read/write outside
allocated memory
01 // compute vector sum C = A + B
02 // each thread peforms one pair-wise addition
03 __global__
04 void vecAddKernel(float* A, float* B, float* C, int
n) {
05 int i = threadIdx.x + blockDim.x * blockIdx.x;
06 if (i < n) { // check bounds
07 C[i] = A[i] + B[i];
08 }
09 }
Calling Kernels
• kernel configuration is specified between `<<<` and
`>>>`
• number of blocks, number of threads in each block
dim3 numThreads(256);
dim3 numBlocks((n + numThreads - 1) / numThreads);
vecAddKernel<<<numBlocks, numThreads>>>(A_d, B_d, C_d,
n);

• we will learn about additional launch parameters

(shared-mem size, cudaStream) later
Compiler
• nvcc (NVIDIA C compiler) is used to compile kernels
into PTX
• Parallel Thread Execution (PTX) is a low-level VM &
instruction set
• graphics driver translates PTX into executable binary
code (SASS)
Ch 3: Multidimensional grids and
data
• CUDA grid: 2 level hierarchy: blocks, threads
• Idea: map threads to multi-dimensional data
• all threads in a grid execute the same kernel
• threads in same block can access the same shared mem
• max block size: 1024 threads
• built-in 3D coordinates of a thread: blockIdx, threadIdx -
identify which portion of the data to process
• shape of grid & blocks:
• gridDim: number of blocks in the grid
• blockDim: number of threads in a block
Grid continued
• grid can be different for each kernel launch, e.g.
dependent on data shapes
• typical grids contain thousands to millions of threads
• simple strategy: one thread per output element (e.g. one
thread per pixel, one thread per tensor element)
• threads can be scheduled in any order
• can use fewer than 3 dims (set others to 1)
• e.g. 1D for sequences, 2D for images etc.
dim3 grid(32, 1, 1);
dim3 block(128, 1, 1);
kernelFunction<<<grid, block>>>(..);
// Number of threads: 128 * 32=4096
Built-in Variables
• Built-in variables inside kernels:

blockIdx // dim3 block coordinate

threadIdx // dim3 thread coordinate
blockDim // number of threads in a block
gridDim // number of blocks in a grid

• (blockDim & gridDim have the same values in all

threads)
nd-Arrays in Memory
Actual layout in Logical view of
memory data
0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 2,0 2,1 2,2 2,3 3,0 3,1 3,2 3,3 0,0 0,1 0,2 0,3

1,0 1,1 1,2 1,3

• memory of multi-dim arrays under the hood is flat
2,0 2,1 2,2 2,3
1d
• 2d array can be linearized different ways: 3,0 3,1 3,2 3,3

- row-major: - column-major:
A B C A D G
D E F B E H
G H I C F I

• torch tensors & numpy ndarrays use strides to

specify how elements are laid out in memory.
Image blur example (3.3, p. 60)
• mean filter example blurKernel
• each thread writes one output element, reads multiple
values
• single plane in book, can be extended easily to multi-
channel
• shows row-major pixel memory access (in & out
pointers)
• track of how many pixels values are summed
• handles boundary conditions in ln 5 & 25
Handling Boundary Conditions
Matrix Multiplication
• staple of science & engineering (and deep learning)
• compute inner-products of rows & columns
• Strategy: 1 thread per output matrix element
• Example: Multiplying square matrices (rows == cols)

Node.js 63 Interview Questions and Answers
From Everand
Node.js 63 Interview Questions and Answers
John Edward Cooper Berg
No ratings yet
di529r18-ATAATAPI Command Set - 4 PDF
No ratings yet
di529r18-ATAATAPI Command Set - 4 PDF
659 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
01 Cuda c Basics
No ratings yet
01 Cuda c Basics
32 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
chapter-8
No ratings yet
chapter-8
58 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Lec 1
No ratings yet
Lec 1
27 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
CUDA
No ratings yet
CUDA
33 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
3-CUDA
No ratings yet
3-CUDA
5 pages
217 Lec3
No ratings yet
217 Lec3
46 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
CUDA Programming: Johan Seland Johan - Seland@sintef - No
No ratings yet
CUDA Programming: Johan Seland Johan - Seland@sintef - No
76 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Pgi Cuda Tutorial
No ratings yet
Pgi Cuda Tutorial
58 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
02 RTVis GPGPU CUDA
No ratings yet
02 RTVis GPGPU CUDA
34 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
1
No ratings yet
1
44 pages
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
No ratings yet
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
3 pages
Class 10
No ratings yet
Class 10
13 pages
CUDA_1
No ratings yet
CUDA_1
45 pages
G80 Cuda
No ratings yet
G80 Cuda
25 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
Class4 Advanced Cuda Opencl
No ratings yet
Class4 Advanced Cuda Opencl
64 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
CatanzaroIntroToGPUs
No ratings yet
CatanzaroIntroToGPUs
76 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
2023-CSC14120-Lecture01-CUDAIntroduction
No ratings yet
2023-CSC14120-Lecture01-CUDAIntroduction
32 pages
CUDA Programming Invert
No ratings yet
CUDA Programming Invert
36 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
Threads and Memory7
No ratings yet
Threads and Memory7
42 pages
CUDAProgModel
No ratings yet
CUDAProgModel
24 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
CUDA Exercises
No ratings yet
CUDA Exercises
185 pages
CUDA PPT Anurita Unit3
No ratings yet
CUDA PPT Anurita Unit3
42 pages
IntroGPUs
No ratings yet
IntroGPUs
36 pages
CUDA Programming On Nvidia Gpus: Mike Giles
No ratings yet
CUDA Programming On Nvidia Gpus: Mike Giles
21 pages
Threads
No ratings yet
Threads
54 pages
Barnett Haskins
No ratings yet
Barnett Haskins
29 pages
Electronic Structure Calculations on Graphics Processing Units: From Quantum Chemistry to Condensed Matter Physics
From Everand
Electronic Structure Calculations on Graphics Processing Units: From Quantum Chemistry to Condensed Matter Physics
Ross C. Walker
No ratings yet
PJ2N9013 NPN Epitaxial Silicon Transistor: 1W Output Amplifier of Potable Radios in Class B Push-Pull Operation
No ratings yet
PJ2N9013 NPN Epitaxial Silicon Transistor: 1W Output Amplifier of Potable Radios in Class B Push-Pull Operation
3 pages
Exploring Software Defined Networks For Seamless Handovers
No ratings yet
Exploring Software Defined Networks For Seamless Handovers
11 pages
Connectrix - Brocade FOS SCG CallHome Setup Procedure-BRCD FOS SCG CallHome Setup
No ratings yet
Connectrix - Brocade FOS SCG CallHome Setup Procedure-BRCD FOS SCG CallHome Setup
7 pages
BQ 25504
No ratings yet
BQ 25504
36 pages
FDC Lab Manual
No ratings yet
FDC Lab Manual
25 pages
CS1010S Tutorial 7 PDF
No ratings yet
CS1010S Tutorial 7 PDF
23 pages
FortiClient-Cloud Deployment Guide
No ratings yet
FortiClient-Cloud Deployment Guide
28 pages
A Review On The Emerging Technology of TinyML
No ratings yet
A Review On The Emerging Technology of TinyML
37 pages
Controlling Arduino With Android Using Processing
No ratings yet
Controlling Arduino With Android Using Processing
8 pages
PME Device Support Matrix - 23128
No ratings yet
PME Device Support Matrix - 23128
72 pages
OS9244-410G_Preliminary
No ratings yet
OS9244-410G_Preliminary
4 pages
CCMS Configuration: Central System Monitoring With Solution Manager
No ratings yet
CCMS Configuration: Central System Monitoring With Solution Manager
14 pages
Vproxy Guide
No ratings yet
Vproxy Guide
368 pages
PQube 3 Power Analyzer Data Sheet.V3 - en
No ratings yet
PQube 3 Power Analyzer Data Sheet.V3 - en
4 pages
Twin-Node Neighbour Attack On Aodv Based Wireless Ad Hoc Network
No ratings yet
Twin-Node Neighbour Attack On Aodv Based Wireless Ad Hoc Network
15 pages
Spotify - Behind The Scenes: Gunnar Kreitz
No ratings yet
Spotify - Behind The Scenes: Gunnar Kreitz
53 pages
15.1.1 Single-Level Page Tables
No ratings yet
15.1.1 Single-Level Page Tables
2 pages
Smart Lock System
No ratings yet
Smart Lock System
7 pages
РЛС ICE RADAR FICE-100 Мануал1
No ratings yet
РЛС ICE RADAR FICE-100 Мануал1
20 pages
Adjustable Constant Current Source 4ma To 3A
No ratings yet
Adjustable Constant Current Source 4ma To 3A
6 pages
Reading Line of Text
No ratings yet
Reading Line of Text
4 pages
Eoffice User Reference Guide
No ratings yet
Eoffice User Reference Guide
62 pages
AN4134 Designing Forward Smps
No ratings yet
AN4134 Designing Forward Smps
14 pages
LRS Quick Guide
No ratings yet
LRS Quick Guide
5 pages
Drim TILE-Gx8009 PB036-02 WEB 7663
No ratings yet
Drim TILE-Gx8009 PB036-02 WEB 7663
2 pages
CNC 3020 DJ Guide
No ratings yet
CNC 3020 DJ Guide
13 pages
Lab 3 - Introduction To Electric Circuit
No ratings yet
Lab 3 - Introduction To Electric Circuit
6 pages
EEE342 MP Lab Manual
No ratings yet
EEE342 MP Lab Manual
133 pages
Chapter2-Layer 3 VPN Using IPSec-revised-final
No ratings yet
Chapter2-Layer 3 VPN Using IPSec-revised-final
56 pages