10 GPU-IntroCUDA3
10 GPU-IntroCUDA3
10 GPU-IntroCUDA3
0
GPGPU
Introduction
Alan Gray
EPCC
The University of Edinburgh
What is a GPU
3
GPU vs CPU: different philosophies
= compute unit!
(= core)!
= compute unit!
(= SM !
= 32 CUDA cores)!
Alan Gray
NVIDIA HPC GPU Solutions
FP32 RAM Bandwidth
Model cores Link
[TFlops] [GB] [GB/s]
PCIe 3.0
Kepler K40 4.3 2280 12 GDDR5 240
(15.8 GB/s)
PCIe 3.0
Pascal P100 10.6 3584 16 HBM2 720
(15.8 GB/s)
PCIe 3.0
Volta V100 15.7 5120 16/32 HBM2 900
(15.8 GB/s)
PCIe 4.0
Ampere A100 19.5 8192 40 HBM2 1500
(31.6 GB/s)
• Tesla serie is the NVIDIA top gamma GPU solution for HPC
• the GeForce series are for gaming
• HBM2: gen2 high-performance RAM interface for 3D-stacked DRAM with Error-correcting
(ECC)
AMD HPC GPU Solutions
FP32 RAM Bandwidth
Model cores Link
[TFlops] [GB] [GB/s]
PCIe 3.0
Radeon MI8 8.2 4096 4 HBM 512
(15.8 GB/s)
PCIe 3.0
Radeon MI25 12.3 4096 16 HBM2 484
(15.8 GB/s)
PCIe 3.0
Radeon MI50 13.4 3840 16 HBM2 1024
(15.8 GB/s)
PCIe 4.0
Radeon MI100 32.1 7680 32 HBM2 1200
(31.6 GB/s)
• VEGA Processor is the AMD top gamma GPU solution for HPC
• the Radeon RX VEGA/500/400 series are for gaming
• HBM2: gen2 high-performance RAM interface for 3D-stacked DRAM with Error-
correcting (ECC)
• Infinity Fabric™ Links per GPU deliver up to 200 GB/s of peer-to-peer bandwidth
• very high TDP factor sustainable on HPC server blades
GPGPU Programming Model
General Purpose GPU Programming relates to use GPU
computational power to solve problems other than graphics
CPU and GPU are separate devices with separate memory space
addresses
GPU is seen as an auxilirary coprocessor equiped with thousands
of cores and a high bandwidth memory
They should work togheter for best benefit and performances
CPU GPU
4
GPGPU Programming Model
CPU GPU
5
GPGPU Programming Model
serial parts of a program, or those with low level of parallelism,
keep running on the CPU (host)
computational-intensive data-parallel regions are executed
on the GPU (device)
required data is moved on GPU memory and back to HOST memory
26
There cannot be a GPU
without a CPU
GPUs are designed as numeric
computing engines, therefore they
will not perform well on other tasks.
Alan Gray
GPU Servers
• Several vendors
offer GPU Servers
• Example
Configuration:
– 4 GPUs plus 2
(multi-core) CPUs
Alan Gray
Cray XK6 Compute Blade
Alan Gray
Scaling to larger systems
PCIe!
I/O! I/O!
CPU! GPU +!
GDRAM!
Interconnect!
DRAM!
CPU! GPU +!
Interconnect allows GDRAM!
multiple nodes to be
connected! PCIe!
I/O! I/O!
…
GPU Architecture Scheme
A tipical GPU architecture consists of
Main Global Memory
• medium size (8-16 GB)
• very hgh bandwidth (250-800 GB/s)
control units
each SM unit has
Streaming Processors, i.e.
• many ALU cores ( > 100 cores)
• lots of registers (32K-64K)
• instruction scheduler dispatchers
• a shared memory with very fast
access to data
6
GPU Functional Unit Types
FP32: performs 32-bit floating point add,
multiply, multiply/add, and similar
instructions.
INT32: performs 32-bit add, multiply,
multiply-add, and maybe some logical
operations.
FP64: pxecutes 64-bit FP operations
Special Functional Unit (SFU): performs
reciprocal (1x) and transcendental
instructions such as sine, cosine, and
reciprocal square root.
Load/Store (LS): performs loads and stores
from shared, constant, local, and global
memory address spaces.
Tensor Core: specialized units to compute
A*B+C matrix product.
12
Nvidia SM
• Less scheduling units than
cores
• Threads are scheduled in
groups of 32, called a warp
• Threads within a warp always
execute the same instruction
in lock-step (on different data
elements)
• Configurable L1 Cache/
Shared Memory
NVIDIA Volta V100 Architecture (2017)
A full GV100 GPU unit https://developer.nvidia.com/blog/inside-volta
contains 6 Compute
Graphic Clusters (CGC) with
14 SM each, total 84 SMs
5376 FP32 cores
5376 INT32 cores
6MB L2 cache
High Bandwidth Memory
• 16 GB HBM2 SDRAM
• 732 GB/s bandwidth
NVLink tecnology
• 300GB/s bandwidth to host
Peak Performance:
data transfers 15,7 FP32 TFlops
• 12X respect PCIe Gen3 16x Max Power Consumption: 300W
15
Streaming Multiprocessor of nVIDIA Volta (2017)
SM composed of 4 independent
blocks
each block sports:
• 1 warps x 2 dispatchers
• 16FP32 + 16INT32 ALU units
• separate FP32 and INT32 cores,
allowing simultaneous execution
of FP32 and INT32 operations at
full throughput
• 8FP64 ALU units
• 2 Tensor Core units (HW matmul)
• 8 Load/Store units
• 4 SFU units
• 32768 32bits registers
each block accesses:
• 128KB for L1/shared memory
• 4 texture units
16
NVIDIA Ampere A100 Architecture (2020)
A full GA100 GPU unit developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth
contains 8 Compute
Graphic Clusters (CGC) with
16 SM each, total 128 SMs
8192 FP32 cores
8192 INT32 cores
40MB L2 cache
High Bandwidth Memory
• 40GB HBM2
• 732 GB/s bandwidth
NVLink tecnology
• 600GB/s bandwidth to host
Peak Performance:
data transfers 19,5 FP32 TFlops
• 24X respect PCIe Gen3 16x Max Power Consumption: 400W
17
Streaming Multiprocessor of nVIDIA Ampere (2020)
SM composed of 4 independent
blocks
each block sports:
• 1 warps x 2 dispatchers
• 16FP32 + 16INT32 ALU units
• separate FP32 and INT32 cores,
allowing simultaneous execution
of FP32 and INT32 operations at
full throughput
• 8FP64 ALU units
• 2 Tensor Core units (HW matmul)
• 8 Load/Store units
• 4 SFU units
• 32768 32bits registers
each block accesses:
• 192KB for L1/shared memory
• 4 texture units
18
GPU nVIDIA K80 (2013)
Two GPUs (K40) per device
• 12GB RAM per GPU
• 480 GB/s memory bandwidth
• 15 SM per GPU
• 192 CUDA cores/SM
total of 2880 cuda cores
9
AMD Radeon MI100 Architecture (2020)
A full MI100 GPU unit www.amd.com MI100 microarchitecture
contains a total of 120
Compute Unit (like SMs)
7680 FP32 cores
8MB L2 cache
High Bandwidth Memory
• 32GB HBM2
• 1200 GB/s bandwidth
AMD Infinity Fabric
500GB/s bandwidth to host
data transfers
Peak Performance:
23.0 FP32 TFlops
• 24X respect PCIe Gen3
16x Max Power Consumption: 400W
20
• To utilise a GPU, programs must
– Contain parts targeted at host CPU (most lines of source code)
– Contain parts targeted at GPU (key computational kernels)
– Manage data transfers between distinct CPU and GPU
memory spaces
– Traditional language (e.g C/Fortran) does not provide these
facilities
Alan Gray
Different worlds:
host and device
Host Device
Threading 2 threads per core (SMT), 24/32 e.g.: 1536 (thd x sm) * 14 (sm) = 21504.
resources threads per node. The thread is the The Warp (32 thd) is the atomic
atomic execution unit. execution unit.
Declarative languages
• OpenMP
• v 4.0+ allows offloading of tasks onto GPUs
• OpenAcc
• High-level model, particularly suited for devices such as
GPUs.
Languages
• CUDA
• Extension to C developed by NVIDIA. With PGI
compilers, FORTRAN extension also possible.
• OpenCL
• General framework for writing programs across
heterogenous devices. Often used for non-NVIDIA
GPUs and FPGAs.
C CUDA C
F# Alea.cuBase
10
CUDA programming model
CUDA program:
Serial sections of the code are performed by CPU (host)
The parallel ones (that exhibit rich amount of data
parallelism) are performed by GPU (device) in the SIMD
mode as CUDA kernels.
host and device have separate memory spaces:
programmers need to transfer data between CPU and GPU
in a manner similar to “one-sided” message passing.
CUDA: Compute Unified Device Architecture
CUDA is a general purpose parallel computing platform and
programming model that easy GPU programming, which provides:
a hierarchical multi-threaded programming paradigm that
matches GPU hardware structure
an extensions to higher level programming languages for C/C++
and Fortran to express thread parallelism within a familiar
programming environment
a new architecture instruction set called PTX (Parallel Thread
eXecution) to match GPU tipical hardware
a complete mature SDK: compiler (nvcc), debugger (cuda-gdb),
profiler (nvvp), IDE (insight/eclipse/VS plugins)
a set of GPU accelerated libraries for common scientific
algorithms and requirements ...
2
CUDA GPU ready scientific libraries
dense/sparse linear algebra single/multi-GPU:
cuBLAS, nvBLAS, cuSparse
dense/sparse direct solver and factorizations:
cuSOLVER
Fast Fourier Transform (and related): cuFFT
Random number generator: cuRAND
Common primitives for digital signal processing and
imaging elaboration: NPP (nVIDIA Performance
Primitives)
Deep Learning libraries
... and many, many more
3
GPU Accelerated Libraries
NVIDIA
Visual Processing Video
NVIDIA Encode
Image & Video NPP
5
CUDA - C
Applications
Compiler Programming
Libraries
Directives Languages
11
GPGPU Programming Model
A function which runs on a GPU is called “kernel”
• when a kernel is launched on a GPU thousands of threads will
execute its code
• programmer chooses the number of threads to run
• each thread acts on a different data element independently
• the GPU parallelism is very close to the SPMD paradigm
void vecAddCPU (int N, const float *A, void vecAddGPU (int N, const float *A,
const float *B, float *C) const float *B, float *C)
{ {
for ( int i = 0; i < N; i++ ) int i = blockIdx.x*blockDim.x + threadIdx.x;
c[i] = a[i] + b[i]; if ( i < N) c[i] = a[i] + b[i];
} }
... ...
// call vecAddCPU on N elements // call vecAddGPU on N elements
vecAddCPU ( N, a, b, c ); vecAddGPU<<<1, N>>>( N, a, b, c );
10
Asynchronous execution
By default, GPU operations are asynchronous.
• When you call a function that uses the GPU, the operations are enqueued to the
particular device, but not necessarily executed until later.
• This allows us to execute more computations in parallel, including operations on
CPU or other GPUs.
Instead they are synchronous if
• The environment variable CUDA_LAUNCH_BLOCKING equals to 1.
• using a profiler(nvprof), without enabling concurrent kernel profiling
• memcpy that involve host memory which is not page-locked.
A first program
#include <stdio.h> Try to remove one of the following at a time and see what
happens
void CPUFunction() { • __global__
printf("Hello world from the CPU.\n"); • <<<1,1>>>
} • cudaDeviceSynchronize();
__global__ void GPUFunction() {
printf("Hello world from the the GPU.\n");
}
13
A second program
#include <stdio.h> Try
• <<<1, 1>>
__global__ void GPUfunction() • <<<1,10>>>
{ • <<<10, 1>>
printf("This is running in parallel.\n"); • <<<10, 10>>
} • and, again, remove cudaDeviceSynchronize();
int main()
{
GPUfunction <<<5, 5>>>();
cudaDeviceSynchronize(); COMPILE with
}
nvcc -o second second.cu -run
NVIDIA C compiler
11
GPU Thread Hierarchy
threads are organized into blocks of threads Grid
blockDim:
block dimensions in thread units
gridDim:
grid dimensions in block units
12
This idiomatic expression gives each thread
a unique index within the entire grid.
threadIdx:
(0,0) (1,0) (2,0)
thread coordinates inside a block
blockIdx:
(0,1) (1,1)
j (2,1) block coordinates inside the grid
for blockIdx.x = 0
i = 0 * 32 + threadIdx.x = { 0, 1, 2, ... , 31 }
for blockIdx.x = 1
i = 1 * 32 + threadIdx.x = { 32, 33, 34, ... , 63 }
for blockIdx.x = 2
i = 2 * 32 + threadIdx.x = { 64, 65, 66, ... , 95 }
http://www.icl.utk.edu/~mgates3/docs/cuda.html 14
CUDA Programming Model
GPU threads are extremely light weight
• no penalty in case of a context-switch
• each thread has its own registers
the more are the threads in flight, the more the GPU
hardware is able to hide memory or computational
latencies
17
2D Example
• The previous examples were one dimensional.
• Each thread block can be 1D, 2D or 3D to best fit the
algorithm, e.g. for matrix addition:
__global__ void matrixAdd(float a[N][N], float b[N][N], float c[N][N])
{
int i = threadIdx.x;
int j = threadIdx.y;
int main()
{
dim3 blocksPerGrid(1); /* 1 block per grid (1D) */
dim3 threadsPerBlock(N, N); /* NxN threads per block (2D) */
matrixAdd<<<blocksPerGrid, threadsPerBlock>>>(a, b, c);
}
• dim3 is a CUDA type, containing 3 integers (x,y and z components)
Alan Gray, James Perry 13
Multiple Block 2D Example
• Grid can also be be 1D, 2D or 3D
__global__ void matrixAdd(float a[N][N], float b[N][N], float c[N]
[N])
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
int main()
{
dim3 blocksPerGrid(N/16,N/16); // (N/16)x(N/16) blocks/grid (2D)
dim3 threadsPerBlock(16, 16); /. 16x16 threads/block (2D)
matrixAdd<<<blocksPerGrid, threadsPerBlock>>>(a, b, c);
}
...
7
The full example
#include <stdio.h> // copy the arrays 'a' and 'b' to the GPU
#include <sys/time.h> cudaMemcpy( dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice
);
#define N (32 * 1024)
cudaMemcpy( dev_b, b, N * sizeof(int),
cudaMemcpyHostToDevice );
__global__ void add( int *a, int *b, int *c ) {
int tid = blockIdx.x*blockDim.x + threadIdx.x; add<<<blocks,threads>>>( dev_a, dev_b, dev_c );
if (tid < N) c[tid] = a[tid] + b[tid];
// copy the array 'c' back from the GPU to the CPU
}
cudaMemcpy( c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost
);
int main( void ) { cudaDeviceSynchronize();
int *a, *b, *c, *dev_a, *dev_b, *dev_c; gettimeofday(&t2, 0);
struct timeval t1, t2;
printf( "We did it!\n" );
dim3 threads(32);
double time = (1000000.0*(t2.tv_sec-t1.tv_sec) + t2.tv_usec-
dim3 blocks ( (N+threads.x-1)/threads.x ); t1.tv_usec)/1000.0;
printf("Time to generate: %3.1f ms \n", time);
a = (int*)malloc( N * sizeof(int) ); //the same for b and c
cudaMalloc( (void**)&dev_a, N * sizeof(int) ); ); // free the memory
cudaFree( dev_a );
//the same for dev_b and dev_c
free( a ); //the same for b and c
return 0;
for (int i=0; i<N; i++) { a[i] = i; b[i] = 2 * i; } }
gettimeofday(&t1, 0);
CUDA 6.x - Unified Memory
Unified Memory creates a pool of memory with an address
space that is shared between the CPU and GPU. In other
word, a block of Unified Memory is accessible to both the
CPU and GPU by using the same pointer;
the system automatically migrates data allocated in Unified
Memory mode between the host and device memory
• no need to explicitly declare device memory regions
• no need to explicitly copy back and forth data between CPU and
GPU devices
• greatly simplifies programming and speeds up CUDA ports
REM: it can result in performances degradation with respect
to an explicit, finely tuned data transfer.
33
SINGLE POINTER
Explicit vs Unified Memory
Explicit Memory Management GPU code w/ Unified Memory
void *data, *d_data; void *data;
data = malloc(N); data = malloc(N);
cudaMalloc(&d_data, N);
cpu_func1(data, N); cpu_func1(data, N);
cudaMemcpy(d_data, data, N, ...)
gpu_func2<<<...>>>(d_data, N); gpu_func2<<<...>>>(data, N);
cudaMemcpy(data, d_data, N, ...) cudaDeviceSynchronize();
cudaFree(d_data);
cpu_func3(data, N); cpu_func3(data, N);
free(data); free(data);
3
Sample code using CUDA Unified Memory
cudaDeviceSynchronize();
use_data(data); use_data(data);
free(data) cudaFree(data);
} }
34
Checking CUDA Errors
All CUDA API returns an error code of type cudaError_t
• Special value cudaSuccessmeans that no error occurred
CUDA runme has a convenience funcon that translates a CUDA error
into a readable string with a human understandable descripon of the
type of error occured
char* cudaGetErrorString(cudaError_t code)
if (cerr != cudaSuccess)
fprintf(stderr, “%s\n”, cudaGetErrorString(cerr));
CUDA Asynchronous API returns an error which refers only on errors which may
occur during the call on host
CUDA kernels are asynchronous and void type so they don’t return any error
code
3
Checking Errors for CUDA kernels
The error status is also held in an internal variable, which is modi-ed by each
CUDA API call or kernel launch.
CUDA runme has a funcon that returns the status of internal error variable.
cudaError_t cudaGetLastError(void)
1. Returns the status of internal error variable (cudaSuccessor other)
2. Resets the internal error status to cudaSuccess
•. Error code from cudaGetLastErrormay refers to any other preceeding CUDA API
runme calls
•. To check the error status of a CUDA kernel execuon, we have to wait for kernel
compleon using the following synchronizaon API:
cudaDeviceSynchronize()
// reset internal state
cudaError_t cerr = cudaGetLastError();
// launch kernel
kernelGPU<<<dimGrid,dimBlock>>>(...);
cudaDeviceSynchronize();
cerr = cudaGetLastError();
if (cerr != cudaSuccess)
fprintf(stderr, “%s\n”,
cudaGetErrorString(cerr)); 4
Checking CUDA Errors
Error checking is strongly encouraged during developer phase
Error checking may introduce overhead and unpleasant
synchronizaons during producon run
Error check code can become very verbose and tedious
A common approach is to de-ne a assert style preprocessor macro which can
be turned on/o5 in a simple manner
#de⌅ne CUDA_CHECK(X) {\
cudaError_t _m_cudaStat = X;\
if(cudaSuccess != _m_cudaStat) {\
fprintf(stderr,"\nCUDA_ERROR: %s in ⌅le %s line %d\n",\
cudaGetErrorString(_m_cudaStat), __FILE__, __LINE__);\
exit(1);\
}}
...
CUDA_CHECK( cudaMemcpy(d_buf, h_buf, bu8Size, cudaMemcpyHostToDevice) );
5
Development tools
Common
Memory Checker
Built-in profiler
Visual Profiler
Linux
CUDA GDB
Parallel Nsight for Eclipse
Windows
Parallel Nsight for VisualStudio
Profiling: Visual Profiler
Traces execution at host, driver and kernel levels (unified
timeline)
Supports automated analysis (hardware counters)
Parallel NSight
https://developer.nvidia.com/tools-overview
• nvidia-smi
– Shows which GPUs are available and gives information about them
– Can be used in scrolling mode when running CUDA programs
• nvprof
– Quick profiler, useful for showing memory transfers between host
and device.
– More sophisticated profiling can be done with nvvp.
• cuda-memcheck
– Ideal for spotting memory leaks in the CUDA program. Will
considerably slow execution.
• cuda-gdb
– CUDA debugger
Block 4 Block 5
Block 6 Block 7
15
more on the GPU Execution Model
Software Hardware when a GPU kernel is invoked:
each thread block is assigned to a SM in a round-
robin mode
... • a maximum number of blocks can be assigned to each SM,
depending on hardware generation and on how many
resorces each requires (registers, shared memory, etc)
Grid GPU • the runtime system maintains a list of active blocks and
assigns new blocks to SMs as they complete
• once a block is assigned to a SM, it remains on that SM
until the work for all threads in the block is completed
• each block execution is independent from the other
(no synchronization is possible among them)
threads of each block are partitioned into warps of
consecutive threads
Thread Block
Streaming the scheduler select for execution a warp from
Multiprocessor one of the residing blocks in each SM
A warp execute one common set of instruction at
a time
• each GPU core take care of one thread in the warp
• fully efficiency when all threads agree on their execution
GPU path
Thread core
15
CUDA and NVIDIA GPUs
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities
But often makes sense to set threads/block =1024 and make the number of blocks =
problem_size/1024
22/05/2019 Introduction to CUDA programming 18
Warps
The GPU multiprocessor creates, manages, schedules, and executes threads in
groups of 32 parallel threads called warps.
Individual threads composing a warp start together at the same program address,
but they have their own instruction address counter and register state and are
therefore free to branch and execute independently
16
The SM warp scheduler
The NVIDIA SM schedules threads in groups of 32 threads, called warps
Using 2 warp schedulers per SM allows two warps to be issued and
executed concurrently if hardware resources are available
34
Warps
• A warp executes one common instruction at a time, so full efficiency is realized when all threads of a warp agree on their
execution path.
• If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken,
disabling threads that are not on that path, and when all paths complete, the threads converge back to the same
execution path.
• Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are
executing common or disjointed code paths.
• Each single instruction in a warp is performed in a lockstep. The next instruction can be fetched only when the previous
one has completed.
• An SM statically distributes its warps among its schedulers. Then, at every instruction issue time, each scheduler issues
one instruction for one of its assigned warps (half and quarter-warp) that is ready to execute, if any.
• Volta is equipped with 4 warp-scheduler units. Instructions are performed over two cycles, and the schedulers can issue
independent instructions every cycle. Dependent instruction issue latency for core FMA math operations are reduced to
four clock cycles, so execution latencies of core math operations can be hidden by as few as 4 warps per SM, assuming 4-
way instruction-level parallelism ILP per warp. Many more warps are, of course, recommended to cover the much greater
latency of memory transactions and control-flow operations.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#simt-architecture
MANY DETAILS HERE http://taylorlloyd.ca/gpu,/pascal,/cuda/2017/01/07/gpu-pipelines.html
Volta SM Warp Scheduler
Volta SM has 4 warp scheduler
Each scheduler is responsible for
• feeding 32 CUDA cores
• 8 load/store units
• 8 Special Function Units
35
Instruction Execution
Example
a single Volta processing block has
16 FP32/INT32 and 8 FP64 ALU units
a CUDA warps is 32 threads wide
a FP32 operation on a warp will execute in
32 threads / 16 FP32 ALU = 2 cycles
a FP64 operation on a warp will execute in
32 threads / 8 FP64 ALU = 4 cycles
Each arithmetic operation has a pipe line stage
so that, as soon as one warp has entered the
first stage, a second independent warp can
push its operand into the pipeline.
FMA operations have four clock cycles on Volta:
execution latencies of FMA core math
operations can be hidden by as few as 4 warps
per SM, assuming 4-way instruction-level
parallelism ILP per warp
36
Hiding Latencies
What is latency?
• the number of clock cycles needed to complete an istruction
• ... that is, the number of cycles I need to wait for before another dependent
operation can start
arithmetic latency (~ 18-24 cycles)
memory access latency (~ 400-800 cycles)
We cannot discard latencies (it’s an hardware design effect), but we can
lesser their effect and hide them.
• saturating computational pipelines in computational bound problems
• saturating bandwidth in memory bound problems
We can organize our code so to provide the scheduler a sufficient number
of independent operations, so that the more the warp are available, the
more content-switch can hide latencies and proceed with other useful
operations
There are two possible ways and paradigms to use (can be combined too!)
• Thread-Level Parallelism (TLP)
• Instruction-Level Parallelism (ILP)
37
Thread-Level Parallelism (TLP)
Strive for high SM occupancy: that is try to provide as much
threads per SM as possible, so to easy the scheduler find a
warp ready to execute, while the others are still busy
This kind of approach is effective when there is a low level of
independet operations per CUDA kernels
38
Instruction-Level Parallelism (ILP)
Strive for multiple independent operations inside you CUDA
kernel: that is, let your kernel act on more than one data
this will grant the scheduler to stay on the same warp and
fully load each hardware pipeline
39
Branching example
• E.g you want to split your threads into 2 groups:
i = blockIdx.x*blockDim.x + threadIdx.x;
if (i%2 == 0)
…
else
…
3
Memory coalescing
• Global memory bandwidth for graphics memory on GPU is high
compared to CPU
– But there are many data-hungry cores
– Memory bandwidth is a botteneck
• Maximum bandwidth achieved when data is loaded for multiple
threads in a single transaction: coalescing
• This will happen when data access patterns meet certain
conditions: 16 consecutive threads (half-warp) must access data
from within the same memory segment
• E.g. condition met when consecutive threads read consecutive
memory addresses within a warp.
• Otherwise, memory accesses are serialised, significantly degrading
performance
• Adapting code to allow coalescing can dramatically improve
performance
Alan Gray
Global Memory Load/Store
// strided data copy
__global__ void strideCopy (int N, float *odata, float* idata, int stride) {
int xid = (blockIdx.x*blockDim.x + threadIdx.x) * stride;
if (xid < N) odata[xid] = idata[xid];
}
Measured on a M2070; Total elements = 16776960; Used Blocks = 65535; Block lenght = 256 10
Data alignment in Global Memory
It is very important to align data in memory so to have aligned accesses
(coalesced) during load/store operation in global memory, reducing the number
of segments moved across the bus
• cudaMalloc() grants the alignment of first element in global memory,
useful for one dimensional arrays
• cudaMallocPitch() must be used to allocate 2d buffers
elements are padded so each row is aligned for coalescing accesses
returns an integer (pitch) which can be used as a stride to access row elements
// host code
int width = 64, heigth = 64; int pitch; float *devPtr;
cudaMallocPitch(&devPtr, &pitch, width * sizeof(float), height);
// device code
__global__ myKernel(float *devPtr, int pitch, int width, int height)
{
for (int r = 0; r < height; r++) {
float *row = devPtr + r * pitch;
for (int c = 0; c < width; c++)
float element = row[c];
}
...
}
11
Cache Hierarchy for Global Memory Accesses
GPU designs include cache
hierarchy in order to easy the need (Device) Grid
for space and time data locality Block (0, 0) Block (1, 0)
4
Set cache configuration
cudaDeviceSetCacheConfig ( cudaFuncCache cacheConfig )
Description:
On devices where the L1 cache and shared memory use the same hardware
resources, this sets through cacheConfig the preferred cache configuration for the
current device. This is only a preference. The runtime will use the requested
configuration if possible, but it is free to choose a different configuration if
required to execute the function.
Any function preference set via cudaFuncSetCacheConfig () will be preferred over
this device-wide setting. Launching a kernel with a different preference than the
most recent preference setting, may insert a device-side synchronization point.
The supported cache configurations are:
• cudaFuncCachePreferNone: no preference for shared memory or L1 (default)
• cudaFuncCachePreferShared: prefer larger shared memory and smaller L1 cache
• cudaFuncCachePreferL1: prefer larger L1 cache and smaller shared memory
• cudaFuncCachePreferEqual: prefer equal size L1 cache and shared memory
13
Cache Hierarchy for Global Memory Accesses
Just one type of store operation:
when data should be updated in global (Device) Grid
memory, its L1 copy is invalidated and
updated the L2 cache value Block (0, 0) Block (1, 0)
Warp requires 32 consecutive 4-bytes words not alined to segment (total 128 bytes)
Caching Load Non-caching Load
addresses belong to 2 line cache segments addresses belong to 5 line cache segments
256 bytes are moved over the bus 160 are moved over the bus
bus utilization: 50% bus utilization: 80%
Shared Memory
The Shared Memory is a small,
but quite fast memory mounted (Device) Grid
Specifications: Memory
12
Shared Memory Allocation
// statically inside the kernel ! statically inside the kernel
__global__ myKernelOnGPU (...) { attribute(global)
... subroutine myKernel(...)
__shared__ type shmem[MEMSZ]; ...
... type, shared:: variable_name
} ...
end subroutine
or using dynamic allocation
oppure
// dynamically sized
extern __shared__ type *dynshmem; ! dynamically sized
type, shared:: dynshmem(*)
__global__ myKernelOnGPU (...) {
... attribute(global)
dynshmem[i] = ... ; subroutine myKernel(...)
... ...
} dynshmem(i) = ...
...
void myHostFunction() { end subroutine
...
myKernelOnGPU<<<gs,bs,MEMSZ>>>();
} variables allocated in shared memory has storage
duration of the kernel launch (not persistent!)
only accessible by threads of the same block
14
Thread Block Synchronization
All threads in the same block can be synchronized using
the CUDA runtime API call:
__syncthreads() | call syncthreads()
which blocks execution until all other threads reach the
same call location
can be used in conditional too, but only if all thread in the
block reach the same synchronization call
“... otherwise the code execution is likely to hang or produce
unintended side effects”
15
Using Shared Memory for Thread Cooperation
Threads belonging to the same block can
cooperate togheter using the shared (Device) Grid
memory to share data
• if a thread is in need of some data which has Block (0, 0) Block (1, 0)
been already retrived by another thread in the Shared Memory Shared Memory
same block, this data can be shared using the
shared memory Registers Registers
16
Constant Memory
Constant Memory is the ideal place
to store constant data in read-only (Device) Grid
Specifications: Texture
Memory
• Dimension : 64 KB
• Throughput: 32 bits per warp every 2
clock cycles
19
Accessing Constant Memory
Suppose a kernel is launched using 320 warps per SM and all threads
requests the same data
if data is on global memory:
• all warp will request the same segment from global memory
• the first time segment is copied into L2 cache
• if other data pass through L2, there are good chances it will be lost
• there are good chances that data should be requested 320 times
if data is in constant memory:
• during first warp request, data is copied in constant-cache
• since there is less traffic in constant-cache , there are good chances all other
warp will find the data already in cache, so no more traffic on the BUS
20
Constant Memory Allocation
__constant__ type variable_name; // static
// warning
// cannot be dynamically allocated
! warning
! cannot be dynamically allocated
22
Registers
--maxregcount max_registers
• the number of active blocks per kernel can be forced
Global
using the CUDA special qualifier Memory
__launch_bounds__
Constant
Memory
__global__ void __launch_bounds__
(maxThreadsPerBlock, minBlocksPerMultiprocessor) Texture
my_kernel( … ) { … } Memory
26
Local Memory
Local Memory does not correspond to a real physical memory place
Automatic variables are often placed in local memory by the compiler:
• large structures or arrays that would consume too much register space
If a kernel uses more registers than available (register spilling), the compiler shall move
variables into local memory
Local memory is often mapped to global memory
• using the same Caching hierachies (L1 for read-only variables)
• facing the same latency and bandwidth limitation of global memory
In order to obtain information on how much local, constant, shared memory and registers
are required for each kernel, you can provide the following compiler options
--ptxas-options=-v
27
Local Memory
Local Memory does not correspond to a real physical memory place
Automatic variables are often placed in local memory by the compiler:
• large structures or arrays that would consume too much register space
If a kernel uses more registers than available (register spilling), the compiler shall move
variables into local memory
Local memory is often mapped to global memory
• using the same Caching hierachies (L1 for read-only variables)
• facing the same latency and bandwidth limitation of global memory
In order to obtain information on how much local, constant, shared memory and registers
are required for each kernel, you can provide the following compiler options
--ptxas-options=-v
27
Occupancy
// z = u + v ! z = u + v
for (i=0; i<N; i++) do i = 1,N
z[i] = u[i] + v[i]; z(i) = u(i) + v(i)
end do
22
Vector Sum 2. translate the identified data-parallel portions into CUDA kernels
each thread execute the same kernel, but acts on different data:
• turn the loop into a CUDA kernel function
• map each CUDA thread onto a unique index to access data
• let each thread retrieve, compute and store its own data using the unique address
• prevent out of border access to data if data is not a multiple of thread block size
// z = u + v
for (i=0; i<N; i++)
z[i] = u[i] + v[i];
__global__ void gpuVectAdd (int N, const double *u, const double *v, double *z)
{
// index is a unique identifier of each GPU thread
int index = blockIdx.x * blockDim.x + threadIdx.x ;
if (index < N)
z[index] = u[index] + v[index];
}
23
Vector Sum 2. translate the identified data-parallel portions into CUDA kernels
^(index)
__global__ void gpuVectAdd (int N, const double *u, const double *v, double *z)
{
// index is a unique identifier of each GPU thread
int index = blockIdx.x * blockDim.x + threadIdx.x ;
if (index < N)
z[index] = u[index] + v[index];
}
The __global__ qualifier
declares this function to be a CUDA kernel
CUDA kernels are special C functions:
• can be called from host only
• must be called using the execution configuration syntax
• the return type must be void
• they are asynchronous: control is returned immediately to the
host code
• an explicit synchronization is needed in order to be sure that a
CUDA kernel has completed the execution
CUDA kernels
__global__ void add( int *a, int *b, int *c) {
if (index <N)
c[index] = a[index] + b[index];
in CUDA Fortran the attribute device needs to be used while declaring a GPU
array. The array can be allocated by using the Fortran statement allocate:
27
Vector Sum 3. manage memory transfers and kernel calls
CUDA C API:
cudaMemcpy(void *dst, void *src, size_t size, direction)
• copy size bytes from the src to dst buffer
29
Vector Sum 3. manage memory transfers and kernel calls
kernelCUDA<<<numBlocks,numThreads>>>(...)
dim3 numThreads(32);
dim3 numBlocks( ( N + numThreads – 1 ) / numThreads.x );
gpuVectAdd<<<numBlocks, numThreads>>>( N, u_dev, v_dev, z_dev );
30 30
CUDA variable qualifiers
// In Host Code:
ParentKernel<<<256, 64>>(data);
OpenCL
• See http://www.khronos.org/opencl/
5
OpenACC History
OpenACC is a high-level speci&cation with compiler directives
for expressing parallelism for accelerators.
– Portable to a wide range of accelerators.
– Improve portability
OpenACC 2.5 in November 2015
OpenACC 2.6 in November 2017
OpenACC had more than 10 member organizations
6
OpenACC Info & Vendors
http://www.openacc.org
Novelty in OpenACC 2.0 are signi&cant
• OpenACC 1.0 maybe not very mature...
Some changes are inspired by the development of CUDA programming
model
• but the standard is not limited to NVIDIA GPUs: one of its pros is the
interoperability between platforms
Standard implementation
• CRAY provides full OpenACC 2.0 support in CCE 8.2
• PGI support to OpenACC 2.5 is almost complete (starting from version 15.1)
●
Suppurt for OpanACC 2.0 starting from 14.1
• GNU implementation eort ongoing (there is a partial implementation in the 5.1 release
and a dedicated branch for 7.1 realease)
We will focus on PGI compiler
• 30 days trial license useful for testing
PGI:
• all-in-one compiler, easy usage
• sometimes the compiler tries to help you...
• but also a constraint on the compiler to use
7
OpenACC – Simple, Powerful, Portable
1. Simple:
main() • Simple compiler directives
{
• Directives are the easy path to
<serial code> accelerate compute intensive
applications
#pragma acc kernels
• Compiler parallelizes code
//automatically runs on GPU
{ 2. Open:
<parallel code> • OpenACC is an open GPU directives
}
} standard, making GPU
programming straightforward and
portable across parallel and multi-
core processors
3. Portable:
• Works on many-core GPUs and
multi-core CPUs
4. Powerful:
• GPU Directives allow complete
access to the massive parallel
power of a GPU
9
OpenMP 4.0/4.5 alternative
OpenMP 4.0/4.5 supports heterogeneous systems (accelerators/devices)
What's new in OpenMP 4.x for support accelerator model
– Target regions
●
Structured and unstructured target data regions
– omp target [clause[[,] clause],…]
– omp declare target
●
Asynchronous execution (nowait) and data dependency (depend)
– Manage device data environment
●
Data mapping APIs
–map ([map-type:] list)
●
Data regions
–omp target data [clause[[,] clause], …]
– omp target enter/exit data [clause[[,] clause], …]
– Parallelism & Workshare for devices
●
omp teams [clause[[,] clause],…]
●
omp distribute [clause[[,] clause],…]
– SIMD parallelism
12
Familiar to OpenMP Programmers
GPU
CPU
OpenMP OpenACC
main() { main() {
double pi = 0.0; long i; double pi = 0.0; long i;
#pragma omp parallel for reduction(+:pi) #pragma acc parallel loop reduction(+:pi)
for (i=0; i<N; i++) for (i=0; i<N; i++)
{ {
double t = (double)((i+0.05)/N); double t = (double)((i+0.05)/N);
pi += 4.0/(1.0+t*t); pi += 4.0/(1.0+t*t);
} }
11