Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
15 views

Using CUDA

Uploaded by

ohaan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Using CUDA

Uploaded by

ohaan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Using CUDA

Oswald Haan
ohaan@gwdg.de
Code Examples:

cp –r ~ohaan/cuda_kurs/* .
cp –r ~ohaan/cuda_kurs_f/* .
A first Example: Adding two Vectors y = a*x+y

void saxpy( int n, float a, float *x, __global__ void saxpy_d( int n, float a,
float *y ) float *x, float *y )
{ {
int i; int i = threadIdx.x;
for (i=0; i<n; i++) { if (i<n) {
y[i] = a*x[i] + y[i]; y[i] = a*x[i] + y[i];
} }
} }

Host code for saxpy routine Device code for saxpy_d kernel routine
A first Example: Adding two Vectors
int main( void ) { int main( void ) {
int N = 1024, i; int N = 1024, i;
float *x, *y; float *x, *y, *x_d, *y_d;
x = (float*)malloc(N*sizeof(float)); x = (float*)malloc(N*sizeof(float));
y = (float*)malloc(N*sizeof(float)); y = (float*)malloc(N*sizeof(float));
cudaMalloc( &x_d, N*sizeof(float));
cudaMalloc( &y_d, N*sizeof(float));

for(i=0; i<N; i++) { for(i=0; i<N; i++) {


x[i] = 1.0f; x[i] = 1.0f;
y[i] = 2.0f; y[i] = 2.0f;
} }

cudaMemcpy(x_d, x, sizeof(float)*N, cudaMemcpyHostToDevice);


cudaMemcpy(y_d, y, sizeof(float)*N, cudaMemcpyHostToDevice);

saxpy( N, 2.0f, x, y ); saxpy_d<<<1,N>>>( N, 2.0f, x_d, y_d );

cudaMemcpy(y, y_d, sizeof(float)*N, cudaMemcpyDeviceToHost);


cudaFree(x_d); cudaFree(y_d);
free(x); free(y); free(x); free(y);
} }

Host code for calling sequential routine Host code for calling kernel routine

code in
~ohaan/cuda_kurs/saxpy.cu
Managing Memory on Host and on Device
a = (float*)malloc(3*sizeof(float));
Allocates memory for three floats at address a in host memory
cudaMalloc( &a_d, 3*sizeof(float) );
Allocates memory for three floats at address a_d in device memory
Stores this address at address &a_d in host memory
a_d a_d[0] a_d[1] a_d[2]
&a_d a_d a_d+1 a_d+2

a[0] a[1] a[2]

a a+1 a+2
value in memory cell
address of memory cell

host memory device memory

cudaMemcpy(a_d, a, 3*sizeof(float), cudaMemcpyHostToDevice);


cudaMemcpy(a, a_d, 3*sizeof(float), cudaMemcpyDeviceToHost);

destinationaddress source address size of data to be copied


Using Unified Memory (compute capability >=3.0
CUDA version >=6.0 )
int main(void) {
int N = 1024, i;
float *x, *y;

cudaMallocManaged( &x, sizeof(float)*N);


cudaMallocManaged( &y, sizeof(float)*N); allocates memory in host- and device-memory
;

for (i=0; i<N; i++) {


x[i] = 1.0f; Initializes data in host-memory
y[i] = 2.0f;
}

saxpy_d<<<1,N>>>( N, a, x, y); synchronization is necessary, because


no access to unified memory from host
cudaDeviceSynchronize();
until device is inactive
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = max(maxError, abs(y[i]-4.0f));
printf("Max error: %f\n", maxError);
reads data from device memory
cudaFree(x); cudaFree(y);
}

Host code
code in
~ohaan/cuda_kurs/saxpy_um.cu
Unified Memory with Static Allocation
const int N=1024;
__device__ __managed__ float x[N], y[N];

__global__
void saxpy_d( float a ) {
y[threadIdx.x] = a*x[threadIdx.x] + y[threadIdx.x];
}

int main(void) {
int i;

for (i=0; i<N; i++) {


x[i] = 1.0f;
y[i] = 2.0f;
}

saxpy_d<<<1,N>>>(2.0f);
cudaDeviceSynchronize();

float maxError = 0.0f;


for (int i = 0; i < N; i++)
maxError = max(maxError, abs(y[i]-4.0f));
printf("Max error: %f\n", maxError);
}

code in
~ohaan/cuda_kurs/saxpy_um_static.cu
Compiling CUDA codes
• CUDA source files must have the extension .cu
• Compiler nvcc is provided in the CUDA toolkit
• CUDA toolkit is available on GWDG‘s cluster frontends
gwdu101, gwdu102
by loading the CUDA toolkit module:
module load cuda
or sourcing the prepared command file
. module.x
• Compiling CUDA source file saxpy.cu with
nvcc –arch=sm_<xx> saxpy.cu –o saxpy
produces executable saxpy,
where xx is the compute capability of the GPU to be used,
xx = 52 is the default if the option –arch is not set.
With –arch=all code for all possible compute capabilities will be generated
Execution environment for CUDA executables
• GWDG’s compute cluster is operated by the workload manager Slurm, which
provides commands for allocating resources, for submitting jobs and for enquiring the
status of the cluster and of jobs
• All nodes with GPU devices belong to the Slurm partition gpu
• The Slurm sinfo command provides a list of nodes with types and numbers of gpus:
> sinfo -p gpu --format=%N,%G
NODELIST,GRES
dge[008-015],gpu:gtx980:4
dge[001-007],gpu:gtx1080:2
dte[001-010],gpu:k40:2
agt[001-002],gpu:v100:8
agq[001-012],gpu:rtx5000:4
General Slurm options
-p|--partition=gpu allocates resources providing gpus
--reservation=gpu-course allocates resources reserved for this course
-N,--nodes=<n> allocates <n> nodes
-n,–-ntasks=<n> allocates resources for starting <n> tasks
--tasks-per-node=<n> allocates resources for starting <n> tasks per node.
(If used with -n , it denotes the maximum number of tasks per node)
-t|--time= <hh:mm:ss> maximum runtime.
(After this time the job is killed)
-o|--output= <filename> store job output in file <filename>
(omitting this option, output is stored in slurm-%J.out, where %J is the jobid)
Slurm Options for Allocating GPUs
-G|--gpus=<n> requests <n> gpus of any kind
--gpus-per-node=<n> requests <n> gpus of any kind per node
Particular types of GPUs can be requested by replacing the <n> in the two options by
<type:n>

The available types on the GWDG Scientific Compute Cluster are currently:
gtx1080, gtx980, k40, v100, rtx5000

--cpus-per-gpu=<n> requests <n> cpus for every allocated gpu


Allocating Resources with salloc
for Submitting Jobs Interactively with srun
gwdu103 > salloc -p gpu -G 1
salloc: Pending job allocation 7409864
salloc: job 7409864 queued and waiting for resources
salloc: job 7409864 has been allocated resources
salloc: Granted job allocation 7409864
salloc: Waiting for resource configuration
salloc: Nodes dge002 are ready for job
bash-4.2$
The new bash-shell runs on the frontend node:
bash-4.2$ hostname
gwdu103
Start commands on allocated resources with srun
bash-4.2$ srun hostname
dge002
Enquiring Types of Allocated Resources
#include <stdio.h>
#include <unistd.h>

int main() {
char name[1024];
int nDevices;
cudaGetDeviceCount(&nDevices);
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
gethostname(name, 1024);
printf("Node name:%s, gpus: %i, Device name:%s, Comp.Cap.:%i.%i\n", name,
nDevices, prop.name, prop.major,prop.minor);
return(0);
} See
} ~ohaan/cuda_kurs/enquire_gpu.cu
Allocation Examples
gwdu103 > nvcc -o enquire.exe enquire_gpu.cu
gwdu103 > salloc -p gpu -N 1 -G rtx5000:3
...
bash-4.2$ srun ./enquire.exe
Node name:agq007, gpus: 3, Device name:Quadro RTX 5000, Comp.Cap.:7.5

gwdu103 > salloc -p gpu -t 100:00 -N 2 -G 3


...
bash-4.2$ srun ./enquire.exe
Node name:dge006, gpus: 1, Device name:NVIDIA GeForce GTX 1080, Comp.Cap.:6.1
Node name:dge005, gpus: 2, Device name:NVIDIA GeForce GTX 1080, Comp.Cap.:6.1
Submitting jobs with job scripts
>sbatch –p gpu –t 10:00 –G 1 --wrap=“./saxpy“

with job script job.script


> sbatch job.script

> cat job.script


#!/bin/bash
#SBATCH -p gpu
#SBATCH –-reservation=gpu-course
#SBATCH -t 10:00
#SBATCH –G 1
./saxpy
see
~ohaan/cuda_kurs/job.script
CUDA Fortran: Adding two Vectors y = a*x+y
subroutine saxpy(N, a, x, y) attributes(global) subroutine saxpy_d(N, a, x, y)
implicit none implicit none
integer :: N integer, value :: N
integer :: i integer :: i
real :: a real, value :: a
real :: x(:), y(:) real :: x(:), y(:)
do i = 1 , N if (i.le.N) then
y(i) = a*x(i) + y(i) y(i) = a*x(i) + y(i)
end do end if
end subroutine saxpy end subroutine saxpy_d

Host code for saxpy routine Device code for saxpy_d kernel routine

The first two actual arguments in the calling sequence of the kernel
have not been declared in the host to reside in device memory,
therefore they have to be passed by value, not by reference
CUDA Fortran: Adding two Vectors
program call_saxpy program call_saxpy
use kernel use kernel
implicit none use cudafor
integer, parameter :: N = 1024 implicit none
integer :: i integer, parameter :: N = 1024
real :: x(N), y(N) integer :: i
real :: x(N), y(N)
real, device :: x_d(N), y_d(N)

do i = 1 , N do i = 1 , N
x(i) = 1.0 x(i) = 1.0
y(i) = 2.0 y(i) = 2.0
end do end do
x_d = x
y_d = y

call saxpy(N, 2.0, x, y) call saxpy_d<<<1,N>>>(N, 2.0, x_d, y_d)


y = y_d
end program call_saxpy end program call_saxpy

Host code for calling sequential routine Host code for calling kernel routine

code in
~ohaan/cuda_kurs_f/saxpy.cuf
CUDA Fortran: using the kernel directive
program call_saxpy
use kernel
use cudafor
implicit none
integer, parameter :: N = 1024
integer :: i
real :: x(N), y(N)
real, device :: x_d(N), y_d(N)

do i = 1 , N
x(i) = 1.0
y(i) = 2.0
end do
x_d = x
y_d = y

!$cuf kernel do(1) <<< *, * >>>


do i = 1 , N
y_d(i) = 2.0*x_d(i) + y_d(i)
end do
y = y_d
end program call_saxpy

Host code for calling kernel routine

code in
~ohaan/cuda_kurs_f/saxpy_dir.cuf
CUDA Fortran Unified Memory
program call_saxpy
use kernel
use cudafor
implicit none
integer, parameter :: N = 1024
integer :: i, istat
real, managed :: x(N), y(N)
real :: maxerr

do i = 1 , N
x(i) = 1.0
y(i) = 2.0
end do

call saxpy_d<<<1,N>>>(N, 2.0, x, y)


istat = cudaDeviceSynchronize()

maxerr=0.0
do i = 1 , N
maxerr = max(maxerr,abs(y(i)-4.0))
end do
write(6,*)' maxerr = ',maxerr

end program call_saxpy code in


~ohaan/cuda_kurs_f/saxpy_um.cuf
Compiling CUDA Fortran codes
• CUDA Fortran source files must have the extension .cuf or .f90
• To be compiled and linked with the nvfortran compiler, which can be used
by loading the CUDA toolkit module:
module load nvhpc
or sourcing the prepared command file
. module.x

• Compile CUDA Fortran source file call_saxpy.cuf with


nvfortran call_saxpy.cuf –o call_saxpy
• Compile CUDA Fortran source file call_saxpy.f90 with
nvfortran –cuda call_saxpy.f90 –o call_saxpy
Enquiring More Device Properties
• cudaGetDeviceCount(&nDevices);
Sets int nDevices to the number of devices available in the node
• cudaGetDeviceProperties(&prop, i);
Delivers in the members of the structure cudaDeviceProp prop
the values for various properties of device number i
Definition of cudaDeviceProp in the section CUDA Runtime API 6.5 of the CUDA Toolkit
Documentation

int main() {
int nDevices; cudaGetDeviceCount(&nDevices);
for (int i = 0; i < nDevices; i++) {
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, i);
printf("Device Number: %d\n", i);
printf(" Device name: %s\n", prop.name);
...
complete code for enquiring in
} ~ohaan/cuda_kurs/device_properties.cu
Output from program device_properties.cu
Device Number: 0 Device Number: 1
Device name: GeForce GTX 980 Device name: GeForce GTX 980
Device capability major revision number: 5 Device capability major revision number: 5
Device capability minor revision number: 2 Device capability minor revision number: 2
Clock Rate (KHz): 1240500 Clock Rate (KHz): 1240500
total Global Memory (byte): 4294770688 total Global Memory (byte): 4294770688
Shared Memory per Block (byte): 49152 Shared Memory per Block (byte): 49152
total Constant Memory (byte): 65536 total Constant Memory (byte): 65536
size of L2 cache (byte): 2097152 size of L2 cache (byte): 2097152
32-bit Registers per Block: 65536 32-bit Registers per Block: 65536
max. Threads per Block: 1024 max. Threads per Block: 1024
number of Threads in Warp: 32 number of Threads in Warp: 32
number of Multiprocessors: 16 number of Multiprocessors: 16
Memory Clock Rate (KHz): 3505000 Memory Clock Rate (KHz): 3505000
Max Grid Size: 2147483647 65535 65535 Max Grid Size: 2147483647 65535 65535
Max Block Size: 1024 1024 64 Max Block Size: 1024 1024 64
Memory Bus Width (bits): 256 Memory Bus Width (bits): 256
Peak Memory Bandwidth (GB/s): 224.320000 Peak Memory Bandwidth (GB/s): 224.320000
CUDA Fortran: Enquiring Device Properties
integer :: i, istat, nDevices
type (cudaDeviceProp) :: prop
istat = cudaGetDeviceCount(nDevices)
do i = 0, nDevices-1
istat = cudaGetDeviceProperties(prop, i)
write(6,*) 'Device Number: ', i
write(6,*) 'Device name: ', prop%name
...
end do

complete code for enquiring in


~ohaan/cuda_kurs_f/device_properties.cuf
CUDA Fortran: pgaccelinfo
> pgaccelinfo

CUDA Driver Version: 9000


NVRM version: NVIDIA UNIX x86_64 Kernel Module 384.111 Tue Dec 19 23:51:45 PST 2017

Device Number: 0
Device Name: GeForce GTX 1080
Device Revision Number: 6.1 Texture Alignment: 512B
Global Memory Size: 8508145664 Clock Rate: 1733 MHz
Number of Multiprocessors: 20 Execution Timeout: No
Concurrent Copy and Execution: Yes Integrated Device: No
Total Constant Memory: 65536 Can Map Host Memory: Yes
Total Shared Memory per Block: 49152 Compute Mode: default
Registers per Block: 65536 Concurrent Kernels: Yes
Warp Size: 32 ECC Enabled: No
Maximum Threads per Block: 1024 Memory Clock Rate: 5005 MHz
Maximum Block Dimensions: 1024, 1024, 64 Memory Bus Width: 256 bits
Maximum Grid Dimensions: 2147483647 x 65535 x 65535 L2 Cache Size: 2097152 bytes
Maximum Memory Pitch: 2147483647B Max Threads Per SMP: 2048
Async Engines: 2
Unified Addressing: Yes
Managed Memory: Yes
PGI Compiler Option: -ta=tesla:cc60
GPU-Properties of different nodes
GWDG NVIDIA Graphics Comp. Clock Device Band Number CUDA CUDA Perf. Ratio
node Modell Chip Capab. rate memory width of SMXes cores cores FP64/
[MHz] [GB] [GB/s] per per FP32
SMX SMX
(SP) (DP)
dte001- 2 Tesla K40 GK110 3.5 745 12 288 15 192 64 1:3
dte015 (Kepler)

dge008- 4 GeForce GM204 5.2 1126 4 224 16 128 4 1:32


dge015 980 (Maxwell)

dge001- 2 GeForce GP104 6.1 1733 8 320 20 128 4 1:32


dge007 1080 (Pascal)

agt001- 8 Tesla GV100 7.0 1380 34 898 80 64 32 1:2


agt002 V100 (Volta)

agq001- 4 Quadro TU104 7.5 1620 16 448 48 64 2 1:32


agq012 rtx5000 (Turing)
Selecting Different GPUs
• Compiling:
nvcc -arch=[sm_30|sm_35|sm_52|sm_61|sm_70|sm_75]
Set this flag according to compute capability of target GPU
Without setting this flag, nvcc compiles for compute capability 5.2

nvfortran –cuda -gpu=[cc35|cc50|cc60|cc70|cc75]


Without setting this flag, nvfortran compiles for compute capabilities of the gpus in
the compiling node, or for all compute capabilities, if no gpu is found.

• Selection of resources:
-G,--gpus= <type>:<n>
possible value for type :
gtx1080, gtx980, k40, v100, rtx5000
How to Use 2 GPUs simultaneously
• Device can be selected with cudaSetDevice(device_number)
• Prepare two executables: exe0 including cudaSetDevice(0)
exe1 including cudaSetDevice(1)

#!/bin/bash

#SBATCH -p gpu
#SBATCH -W 1:00
#SBATCH –N 1
#SBATCH –gpus-per-node=2 selects a node with 2 GPUs

./exe0 > out0 &


./exe1 > out1 & starts two executables asynchronously
CUDA Error Handling
• CUDA functions return an error code of type cudaError_t
 cudaError_t err = cudaMalloc(...)
• which can be translated into an error message by calling
 cudaGetErrorString(err)

• Errors in kernel functions can be enquired by


 kernel<<<grids,threads>>>(...);
 cudaDeviceSynchronize();
 cudaError_t err = cudaGetLastError();
cudaCheckError()
from https://gist.github.com/jefflarkin/5390993

//Macro for checking cuda errors following a cuda launch


or api call
#define cudaCheckError() { \
cudaError_t e=cudaGetLastError(); \
if(e!=cudaSuccess) { \
printf("Cuda failure %s:%d: '%s'\n“ \
,__FILE__,__LINE__,cudaGetErrorString(e));}\
}

macro code in
~ohaan/cuda_kurs/errchk.ut
Exercise: Large Vectors
• Run saxpy.cu with
N > 1024

• Maximal 1024 threads in a single block:


saxpy_d<<<1,N>>>( a, x_d, y_d )
gives unpredictable results for N > 1024
Large Vectors with multiple Threadblocks
• Maximal 1024 threads in a single block:
saxpy_d<<<1,N>>>( a, x_d, y_d )
gives unpredictable results for N > 1024

• Use N_block blocks:


 Modify host code
N_thrpb =1024; N_blks = (N+N_thrpb-1)/N_thrpb
saxpy_d<<<N_blocks,N_thrpb>>>( N, a, x_d, y_d )

 Modify device code


int i = threadIdx.x + blockIdx.x*blockDim.x;
if (i < N) y[i] = a*x[i] + y[i];
Add Large Vectors with Error Checking
#include "errchk.ut"

int main(void) {
int N = 8024, N_thrpb, N_blks, i;
float *x, *y;

cudaMallocManaged( &x, N*sizeof(float));


cudaMallocManaged( &y, N*sizeof(float));

for (i=0; i<N; i++) {


x[i] = 1.0f;
y[i] = 2.0f;
}

N_thrpb =1024; N_blks = (N+N_thrpb-1)/N_thrpb;


saxpy_d<<<N_blks,N_thrpb>>>(N,2.0f,x, y);
cudaDeviceSynchronize(); cudaCheckError();

float maxError = 0.0f;


for (int i = 0; i < N; i++)
maxError = max(maxError, abs(y[i]-4.0f));
printf("Max error: %f\n", maxError);
cudaFree(x); cudaFree(y);
}

complete code in
~ohaan/cuda_kurs/saxpy_large.cu
3 dim Grids and Blocks
• can be configured with the CUDA type dim3:
dim3 gdims(gdim_x,gdim_y,gdim_z);
dim3 bdims(bdim_x,bdim_y,bdim_z);
kernel <<<gdims,bdims>>> (...);
• This will launch a total number of
gdim_x*gdim_y*gdim_z*bdim_x*bdim_y*bdim_z
threads on the device
• At most (number of SMXes)*2048 threads will be executing at any
time
Vector Addition with 3-dim Grids and Blocks

__global__ void saxpy_d( int N, int a, int *x, int *y )


{
int gridsize = gridDim.x * gridDim.y * gridDim.z;
int blocksize = blockDim.x * blockDim.y * blockDim.z;
int id_thr = threadIdx.x + threadIdx.y * blockDim.x
+ threadIdx.z * blockDim.x * blockDim.y;
int id_blk = blockIdx.x + blockIdx.y * gridDim.x
+ blockIdx.z * gridDim.x * gridDim.y;
int i = id_thr + blocksize * id_blk;
if (i < N) y[i] = x[i] + y[i];
}
2-dim Arrays & 2-dim Thread Block
int main(void){
int i, j, n = 4, m = 3;
float *x, *y;
size_t sizea = n*m*sizeof(int);

cudaMallocHost( &x, sizea);


cudaMallocHost( &y, sizea);

for(i=0; i<n; i++) {


for(j=0; j<m; j++) {
x[j+i*m] = 1.0f;
y[j+i*m] = 2.0f;
}
}

dim3 block(5,5);
saxpy_2d_d<<<1,block>>>(n, m, 2.0f, x, y);
cudaDeviceSynchronize(); cudaCheckError();

float maxError = 0.0f;


for (int i = 0; i < n*m; i++)
maxError = max(maxError, abs(y[i]-4.0f));
printf("Max error: %f\n", maxError);
cudaFree(x); cudaFree(y);
}

Host code complete code in


~ohaan/cuda_kurs/saxpy_2d.cu
Adding 2-dim Arrays with 2-dim Thread Blocks
__global__
void saxpy_2d_d( int n, int m, int *a, int *b, int *c ) {
int i, j, index;
j = threadIdx.x; i= threadIdx.y;
if( i<n && j<m ) {

index = i*m + j;

c[index] = a[index] + b[index] ;


}
}

device code

m columns
j = 0,…,m-1
threadIdx.y = 0,..., blockDim.y-1

n rows array index = threadIdx.x + threadIdx.y*m


i = 0,…,n-1
thread index = threadIdx.x + threadIdx.y*blockDim.x

threadIdx.x = 0,..., blockDim.x-1


Large Two-dimensional Arrays
int main(void)
{
int i, j, n = 10000, m = 8000;
float *x, *y;
size_t sizea = n*m*sizeof(int);

cudaMallocHost( &x, sizea);


cudaMallocHost( &y, sizea);

for(i=0; i<n; i++) {


for(j=0; j<m; j++) {
x[j+i*m] = 1.0f;
y[j+i*m] = 2.0f;
}
}
int bdim_x = 32, bdim_y = 32;
int gdim_x=(m+bdim_x-1)/bdim_x, gdim_y=(n+bdim_y-1)/bdim_y;
dim3 blk(bdim_x,bdim_y), grd(gdim_x,gdim_y);
saxpy_2d_d<<<grd,blk>>>(n, m, 2.0f, x, y); cudaCheckError();
cudaDeviceSynchronize(); cudaCheckError();

float maxError = 0.0f;


for (int i = 0; i < n*m; i++) maxError = max(maxError, abs(y[i]-4.0f));
printf("Max error: %f\n", maxError);
printf( "n=%i, m=%i, grddim_x=%i, grddim_y=%i, blkdim_x=%i, blkdim_y=%i\n",
n,m,gdim_x,gdim_y,bdim_x,bdim_y );
cudaFree(x); cudaFree(y);
}
complete code in
Host code ~ohaan/cuda_kurs/saxpy_2d_large.cu
Adding Large Two-dimensional Arrays
__global__
void saxpy_2d_d( int n, int m, float a, float *x, float *y )
{
int i, j, index;
j = blockIdx.x*blockDim.x+threadIdx.x;
i = blockIdx.y*blockDim.y+threadIdx.y;
if( i<n && j<m ) {
index = i*m + j;
y[index] = a*x[index] + y[index] ;
}
}

device code
j = 0,…,m-1

i = 0,…,n-1 n rows

m columns

2-dim array index : (i,j)

2-dim thread index : (blockIdx.x*blockDim.x+threadIdx.x, blockIdx.y*blockDim.y+threadIdx.y)


Timing of CUDA Codes
• Read internal clock before and after a code segment in host code
• Since kernel calls from host are asynchronous, host and device must
be synchronized by cudaDeviceSynchronize() before
calling the internal clock

(double) tstart = int_clock();


...
kernel<<<grids,threads>>>(...);
cudaDeviceSynchronize();
(double) tend = int_clock();
printf( "cpu time : %lf \n", tend-tstart );
Internal Clock for Elapsed Time
C-function gettimeofday returns elapsed time
with microsec precision.

#include <sys/time.h>
double get_el_time(){
struct timeval et;
gettimeofday ( &et,NULL);
return (double)et.tv_sec
+1.e+6*(double)et.tv_usec;
} code for get_ell_time in
~ohaan/cuda_kurs/time.ut
Timing of CUDA Code with CUDA Events
• Read internal clock before and after a code segment in host code

cudaEvent_t start, stop;


cudaEventCreate(&start);cudaEventCreate(&stop);
cudaEventRecord( start, 0 );
...
kernel<<<grids,threads>>>(...);
...
cudaEventRecord( stop, 0 );
cudaEventSynchronize( stop );
float et; cudaEventElapsedTime( &et,start, stop );
cudaEventDestroy( start );cudaEventDestroy( stop );
printf( "cpu time on device : %3.1f millisec \n", et );
Measuring Bandwidth for Adding Arrays
Array size : 10 000 x 10 000
Using unified memory with cudaMallocManaged

two dim block size

GPU type Nominal (1024,1) (32,32) (1,1024)


bandwidth
gtx1080 320 241 239 32

rtx5000 448 380 378 65

v100 898 754 728 135


code for bandwidth measurement in
~ohaan/cuda_kurs/saxpy_bw.cu
Memory Organization, Hardware View
Control unit:
Schedule, dispatch
...

SMX n
SMX 1
Shared 32 bit registers
CPU 1

...
L2 cache
cache
...

...
...

...
Main memory Main memory
Shared memory / L1 cache

constant + texture cache

Host Graphics Device Streaming Multiprocessor (SMX)


Device Memory , Software View

local mem

local mem
local mem

local mem
thread

thread

thread

thread
. . . . . . . . .
block block

shared men shared men

global mem

constant mem
texture mem
Types of Kernel Variables: Local
• Variables (scalars and arrays) defined in the scope of a kernel are local
__global__ void ker1(int szaloc2,..){
int iloc1, iloc2;
float aloc1[6], *aloc2;
aloc2 = (float *)malloc(sizeof(float)*szaloc2);
...
}
• Each thread has its own set of local variables, which are placed in the
register files of the SMXes, or in global memory, if there are not enough
registers or if the variable is an indexed array
• Number of 32 bit registers per SMX 216=65536
• Maximal number of registers per thread 255
• For the maximal number of 2048 threads per SMX
number of registers per thread 32
Types of Kernel Variables: Global
• global variables (scalars and arrays) defined in the scope of the
application, reside in device main memory and are shared by all threads
in the kernels
• If allocated from host by calling cudaMalloc()
they can be accessed from host by cudaMemcpy(...)
• If allocated from host by calling cudaMallocManaged()
they can be accessed from host and from device with the same address
(unified memory)
• accessing the same global variable from a kernel by different threads is
not deterministic, since the order of execution for different blocks of
threads is not prescribed
Accessing Global Variables
• Device memory is accessed by load/store operations for aligned
memory segments of size 32, 64, or 128 Bytes
• If the 32 threads of a warp access 32 int or float variables lying
consecutively in memory, 4 load/store operations of 32 Byte segments
serve all 32 accesses (coalescent access)
• Compare the performance of 2-dim array addition on gtx1080

blockDim.x = 32, blockDim.y = 32; 239 GB/s


blockDim.x = 1, blockDim.y = 1024; 32 GB/s
Nominal bandwidth : 320 GB/s
Copy of Data in Global Memory

Measured bandwidth

Bandwidth [GB/s]
Using different blocksizes

System: NVIDIA Quadro rtx5000

Nominal bandwidth: 448 GB/s

Number of float data copied


Copy of Data in Global Memory

Measured bandwidth

Bandwidth [GB/s]
Using different blocksizes

System: NVIDIA Tesla 100

Nominal bandwidth: 898 GB/s

Number of float data copied


Types of Kernel Variables: Constant
• constant memory variables (scalars and arrays) defined in the scope of the
application, are read only, reside in device main memory are cached in the
constant cache of each SMX and are shared by all threads in the kernels.
• Allocated on device by __device__ __constant__ qualifier
__device__ __constant__ int sconst, float aconst[1024];
• Can be initialized from host with
cudaMemcpyToSymbol(aconst,a_h,1024*sizeof(float));
• If all threads in a kernel read the same data, the use of constant memory
variables reduces the accesses to device memory by employing the SMX’s
8 kB sized constant caches .
Types of Kernel Variables: Shared
• shared variables (scalars and arrays) are defined in the scope of block of threads
of a single kernel function and reside in the shared memory of the SMX executing
the block of threads
• All threads of a block have access to a block‘s shared variables
• Threads of other blocks cannot access a block‘s shared variables

• Static allocation of one or more shared arrays in a kernel function


__global__ void ker1(...){
__shared__ float sh_float[64]; __shared__ int sh_int[64];
...
}
• Dynamic allocation of shared memory (in one single shared array):
declaration outside kernel
extern __shared__ float sh_array[];
allocation in host code via extended execution configuration
size_t N_sh_bytes = 64*sizeof(float);
ker1<<<grid,block,N_sh_bytes>>>(…);
Device Synchronization from Host
• Synchroneous calls: cudaMalloc, cudaMemcpy,
cudaMallocManaged

• Asynchroneous calls: kernel<<<... >>> (...),


cudaMemcpyAsync,
cudaMemPrefetchAsync,
...

• A call in a host program to


cudaDeviceSynchronize();
will synchronize all previously started activities of the device
Thread Synchronization from Device
• In a device function, threads within a block can be synchronized by calling
the barrier
__syncthreads();
• Waits until all threads in a block have reached this instruction and all
accesses to global and shared memory from these threads are completed
• Danger of stalled execution:
if (i < cut ) __syncthreads();
will hang if in a block not all threads have i < cut or i>= cut
• Is used to coordinate memory access from threads within a single block
• __syncthreads()cannot coordinate the execution of threads from
different blocks
Atomic Operations
Example: accumulate the content of array b into memory
location a
• Sequential on host: for (i=0,i<n;i++) a = a + b[i] ;
• Parallel on kernel: if (i<n) a = a + b[i] ;

If several threads modify the content of the same address, the


result depends on the temporal order of their operations.
Atomic Operations
Thread 0 Thread 1
Read r1 from a r1 = 0 Read r1 from a r1 = 0
r2 = r1 + b[0] r2 = b[0] r2 = r1 + b[1] r2 = b[1]
write r2 to a a = b[0]
write r2 to a a = b[1]

Thread 0 Thread 1
Read r1 from a r1 = 0
r2 = r1 + b[0] r2 = b[0]
write r2 to a a = b[0]
Read r1 from a r1 = b[0]
r2 = r1 + b[1] r2 = b[0]+b[1]
write r2 to a a = b[0]+b[1]
Atomic Operations
atomicAdd : atomic for all CUDA threads in the current program
executing in the same compute device as the current thread.
atomicAdd_block : atomic for all CUDA threads in the current
program executing in the same thread block as the current thread
atomicAdd_system : atomic for all threads in the current program
including other CPUs and GPUs in the system

GPU’s with compute capability < 6.0


only support the device wide atomic functions
do not support atomicADD for double precision numbers
Atomic Operations
An atomic function performs a read-modify-write atomic operation on
one 32-bit or 64-bit word residing in global or shared memory. The
operation is atomic in the sense that it is guaranteed to be performed
without interference from other threads
Atomic add:
int atomicAdd(int* value, int incr);
CUDA Fortran:
atomicadd( value, incr )
type of value/incr can be integer(4), integer(8), real(4), real(8)
Many more atomic operations are supported:
cf. CUDA Toolkit Programming Guide B12
CUDA FORTRAN PROGRAMMING GUIDE AND REFERENCE 3.6.6

You might also like