0% found this document useful (0 votes)

15 views

Using CUDA

Uploaded by

ohaan

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

Using CUDA

Uploaded by

ohaan

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

Using CUDA

Oswald Haan
ohaan@gwdg.de
Code Examples:

cp –r ~ohaan/cuda_kurs/* .
cp –r ~ohaan/cuda_kurs_f/* .
A first Example: Adding two Vectors y = a*x+y

void saxpy( int n, float a, float *x, __global__ void saxpy_d( int n, float a,
float *y ) float *x, float *y )
{ {
int i; int i = threadIdx.x;
for (i=0; i<n; i++) { if (i<n) {
y[i] = a*x[i] + y[i]; y[i] = a*x[i] + y[i];
} }
} }

Host code for saxpy routine Device code for saxpy_d kernel routine
A first Example: Adding two Vectors
int main( void ) { int main( void ) {
int N = 1024, i; int N = 1024, i;
float *x, *y; float *x, *y, *x_d, *y_d;
x = (float*)malloc(N*sizeof(float)); x = (float*)malloc(N*sizeof(float));
y = (float*)malloc(N*sizeof(float)); y = (float*)malloc(N*sizeof(float));
cudaMalloc( &x_d, N*sizeof(float));
cudaMalloc( &y_d, N*sizeof(float));

for(i=0; i<N; i++) { for(i=0; i<N; i++) {

x[i] = 1.0f; x[i] = 1.0f;
y[i] = 2.0f; y[i] = 2.0f;
} }

cudaMemcpy(x_d, x, sizeof(float)*N, cudaMemcpyHostToDevice);

cudaMemcpy(y_d, y, sizeof(float)*N, cudaMemcpyHostToDevice);

saxpy( N, 2.0f, x, y ); saxpy_d<<<1,N>>>( N, 2.0f, x_d, y_d );

cudaMemcpy(y, y_d, sizeof(float)*N, cudaMemcpyDeviceToHost);

cudaFree(x_d); cudaFree(y_d);
free(x); free(y); free(x); free(y);
} }

Host code for calling sequential routine Host code for calling kernel routine

code in
~ohaan/cuda_kurs/saxpy.cu
Managing Memory on Host and on Device
a = (float*)malloc(3*sizeof(float));
Allocates memory for three floats at address a in host memory
cudaMalloc( &a_d, 3*sizeof(float) );
Allocates memory for three floats at address a_d in device memory
Stores this address at address &a_d in host memory
a_d a_d[0] a_d[1] a_d[2]
&a_d a_d a_d+1 a_d+2

a[0] a[1] a[2]

a a+1 a+2
value in memory cell
address of memory cell

host memory device memory

cudaMemcpy(a_d, a, 3*sizeof(float), cudaMemcpyHostToDevice);

cudaMemcpy(a, a_d, 3*sizeof(float), cudaMemcpyDeviceToHost);

destinationaddress source address size of data to be copied

Using Unified Memory (compute capability >=3.0
CUDA version >=6.0 )
int main(void) {
int N = 1024, i;
float *x, *y;

cudaMallocManaged( &x, sizeof(float)*N);

cudaMallocManaged( &y, sizeof(float)*N); allocates memory in host- and device-memory
;

for (i=0; i<N; i++) {

x[i] = 1.0f; Initializes data in host-memory
y[i] = 2.0f;
}

saxpy_d<<<1,N>>>( N, a, x, y); synchronization is necessary, because

no access to unified memory from host
cudaDeviceSynchronize();
until device is inactive
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = max(maxError, abs(y[i]-4.0f));
printf("Max error: %f\n", maxError);
reads data from device memory
cudaFree(x); cudaFree(y);
}

Host code
code in
~ohaan/cuda_kurs/saxpy_um.cu
Unified Memory with Static Allocation
const int N=1024;
__device__ __managed__ float x[N], y[N];

__global__
void saxpy_d( float a ) {
y[threadIdx.x] = a*x[threadIdx.x] + y[threadIdx.x];
}

int main(void) {
int i;

for (i=0; i<N; i++) {

x[i] = 1.0f;
y[i] = 2.0f;
}

saxpy_d<<<1,N>>>(2.0f);
cudaDeviceSynchronize();

float maxError = 0.0f;

for (int i = 0; i < N; i++)
maxError = max(maxError, abs(y[i]-4.0f));
printf("Max error: %f\n", maxError);
}

code in
~ohaan/cuda_kurs/saxpy_um_static.cu
Compiling CUDA codes
• CUDA source files must have the extension .cu
• Compiler nvcc is provided in the CUDA toolkit
• CUDA toolkit is available on GWDG‘s cluster frontends
gwdu101, gwdu102
by loading the CUDA toolkit module:
module load cuda
or sourcing the prepared command file
. module.x
• Compiling CUDA source file saxpy.cu with
nvcc –arch=sm_<xx> saxpy.cu –o saxpy
produces executable saxpy,
where xx is the compute capability of the GPU to be used,
xx = 52 is the default if the option –arch is not set.
With –arch=all code for all possible compute capabilities will be generated
Execution environment for CUDA executables
• GWDG’s compute cluster is operated by the workload manager Slurm, which
provides commands for allocating resources, for submitting jobs and for enquiring the
status of the cluster and of jobs
• All nodes with GPU devices belong to the Slurm partition gpu
• The Slurm sinfo command provides a list of nodes with types and numbers of gpus:
> sinfo -p gpu --format=%N,%G
NODELIST,GRES
dge[008-015],gpu:gtx980:4
dge[001-007],gpu:gtx1080:2
dte[001-010],gpu:k40:2
agt[001-002],gpu:v100:8
agq[001-012],gpu:rtx5000:4
General Slurm options
-p|--partition=gpu allocates resources providing gpus
--reservation=gpu-course allocates resources reserved for this course
-N,--nodes=<n> allocates <n> nodes
-n,–-ntasks=<n> allocates resources for starting <n> tasks
--tasks-per-node=<n> allocates resources for starting <n> tasks per node.
(If used with -n , it denotes the maximum number of tasks per node)
-t|--time= <hh:mm:ss> maximum runtime.
(After this time the job is killed)
-o|--output= <filename> store job output in file <filename>
(omitting this option, output is stored in slurm-%J.out, where %J is the jobid)
Slurm Options for Allocating GPUs
-G|--gpus=<n> requests <n> gpus of any kind
--gpus-per-node=<n> requests <n> gpus of any kind per node
Particular types of GPUs can be requested by replacing the <n> in the two options by
<type:n>

The available types on the GWDG Scientific Compute Cluster are currently:
gtx1080, gtx980, k40, v100, rtx5000

--cpus-per-gpu=<n> requests <n> cpus for every allocated gpu

Allocating Resources with salloc
for Submitting Jobs Interactively with srun
gwdu103 > salloc -p gpu -G 1
salloc: Pending job allocation 7409864
salloc: job 7409864 queued and waiting for resources
salloc: job 7409864 has been allocated resources
salloc: Granted job allocation 7409864
salloc: Waiting for resource configuration
salloc: Nodes dge002 are ready for job
bash-4.2$
The new bash-shell runs on the frontend node:
bash-4.2$ hostname
gwdu103
Start commands on allocated resources with srun
bash-4.2$ srun hostname
dge002
Enquiring Types of Allocated Resources
#include <stdio.h>
#include <unistd.h>

int main() {
char name[1024];
int nDevices;
cudaGetDeviceCount(&nDevices);
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
gethostname(name, 1024);
printf("Node name:%s, gpus: %i, Device name:%s, Comp.Cap.:%i.%i\n", name,
nDevices, prop.name, prop.major,prop.minor);
return(0);
} See
} ~ohaan/cuda_kurs/enquire_gpu.cu
Allocation Examples
gwdu103 > nvcc -o enquire.exe enquire_gpu.cu
gwdu103 > salloc -p gpu -N 1 -G rtx5000:3
...
bash-4.2$ srun ./enquire.exe
Node name:agq007, gpus: 3, Device name:Quadro RTX 5000, Comp.Cap.:7.5

gwdu103 > salloc -p gpu -t 100:00 -N 2 -G 3

...
bash-4.2$ srun ./enquire.exe
Node name:dge006, gpus: 1, Device name:NVIDIA GeForce GTX 1080, Comp.Cap.:6.1
Node name:dge005, gpus: 2, Device name:NVIDIA GeForce GTX 1080, Comp.Cap.:6.1
Submitting jobs with job scripts
>sbatch –p gpu –t 10:00 –G 1 --wrap=“./saxpy“

with job script job.script

> sbatch job.script

> cat job.script

#!/bin/bash
#SBATCH -p gpu
#SBATCH –-reservation=gpu-course
#SBATCH -t 10:00
#SBATCH –G 1
./saxpy
see
~ohaan/cuda_kurs/job.script
CUDA Fortran: Adding two Vectors y = a*x+y
subroutine saxpy(N, a, x, y) attributes(global) subroutine saxpy_d(N, a, x, y)
implicit none implicit none
integer :: N integer, value :: N
integer :: i integer :: i
real :: a real, value :: a
real :: x(:), y(:) real :: x(:), y(:)
do i = 1 , N if (i.le.N) then
y(i) = a*x(i) + y(i) y(i) = a*x(i) + y(i)
end do end if
end subroutine saxpy end subroutine saxpy_d

Host code for saxpy routine Device code for saxpy_d kernel routine

The first two actual arguments in the calling sequence of the kernel
have not been declared in the host to reside in device memory,
therefore they have to be passed by value, not by reference
CUDA Fortran: Adding two Vectors
program call_saxpy program call_saxpy
use kernel use kernel
implicit none use cudafor
integer, parameter :: N = 1024 implicit none
integer :: i integer, parameter :: N = 1024
real :: x(N), y(N) integer :: i
real :: x(N), y(N)
real, device :: x_d(N), y_d(N)

do i = 1 , N do i = 1 , N
x(i) = 1.0 x(i) = 1.0
y(i) = 2.0 y(i) = 2.0
end do end do
x_d = x
y_d = y

call saxpy(N, 2.0, x, y) call saxpy_d<<<1,N>>>(N, 2.0, x_d, y_d)

y = y_d
end program call_saxpy end program call_saxpy

Host code for calling sequential routine Host code for calling kernel routine

code in
~ohaan/cuda_kurs_f/saxpy.cuf
CUDA Fortran: using the kernel directive
program call_saxpy
use kernel
use cudafor
implicit none
integer, parameter :: N = 1024
integer :: i
real :: x(N), y(N)
real, device :: x_d(N), y_d(N)

do i = 1 , N
x(i) = 1.0
y(i) = 2.0
end do
x_d = x
y_d = y

!$cuf kernel do(1) <<< , >>>

do i = 1 , N
y_d(i) = 2.0*x_d(i) + y_d(i)
end do
y = y_d
end program call_saxpy

Host code for calling kernel routine

code in
~ohaan/cuda_kurs_f/saxpy_dir.cuf
CUDA Fortran Unified Memory
program call_saxpy
use kernel
use cudafor
implicit none
integer, parameter :: N = 1024
integer :: i, istat
real, managed :: x(N), y(N)
real :: maxerr

do i = 1 , N
x(i) = 1.0
y(i) = 2.0
end do

call saxpy_d<<<1,N>>>(N, 2.0, x, y)

istat = cudaDeviceSynchronize()

maxerr=0.0
do i = 1 , N
maxerr = max(maxerr,abs(y(i)-4.0))
end do
write(6,*)' maxerr = ',maxerr

end program call_saxpy code in

~ohaan/cuda_kurs_f/saxpy_um.cuf
Compiling CUDA Fortran codes
• CUDA Fortran source files must have the extension .cuf or .f90
• To be compiled and linked with the nvfortran compiler, which can be used
by loading the CUDA toolkit module:
module load nvhpc
or sourcing the prepared command file
. module.x

• Compile CUDA Fortran source file call_saxpy.cuf with

nvfortran call_saxpy.cuf –o call_saxpy
• Compile CUDA Fortran source file call_saxpy.f90 with
nvfortran –cuda call_saxpy.f90 –o call_saxpy
Enquiring More Device Properties
• cudaGetDeviceCount(&nDevices);
Sets int nDevices to the number of devices available in the node
• cudaGetDeviceProperties(&prop, i);
Delivers in the members of the structure cudaDeviceProp prop
the values for various properties of device number i
Definition of cudaDeviceProp in the section CUDA Runtime API 6.5 of the CUDA Toolkit
Documentation

int main() {
int nDevices; cudaGetDeviceCount(&nDevices);
for (int i = 0; i < nDevices; i++) {
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, i);
printf("Device Number: %d\n", i);
printf(" Device name: %s\n", prop.name);
...
complete code for enquiring in
} ~ohaan/cuda_kurs/device_properties.cu
Output from program device_properties.cu
Device Number: 0 Device Number: 1
Device name: GeForce GTX 980 Device name: GeForce GTX 980
Device capability major revision number: 5 Device capability major revision number: 5
Device capability minor revision number: 2 Device capability minor revision number: 2
Clock Rate (KHz): 1240500 Clock Rate (KHz): 1240500
total Global Memory (byte): 4294770688 total Global Memory (byte): 4294770688
Shared Memory per Block (byte): 49152 Shared Memory per Block (byte): 49152
total Constant Memory (byte): 65536 total Constant Memory (byte): 65536
size of L2 cache (byte): 2097152 size of L2 cache (byte): 2097152
32-bit Registers per Block: 65536 32-bit Registers per Block: 65536
max. Threads per Block: 1024 max. Threads per Block: 1024
number of Threads in Warp: 32 number of Threads in Warp: 32
number of Multiprocessors: 16 number of Multiprocessors: 16
Memory Clock Rate (KHz): 3505000 Memory Clock Rate (KHz): 3505000
Max Grid Size: 2147483647 65535 65535 Max Grid Size: 2147483647 65535 65535
Max Block Size: 1024 1024 64 Max Block Size: 1024 1024 64
Memory Bus Width (bits): 256 Memory Bus Width (bits): 256
Peak Memory Bandwidth (GB/s): 224.320000 Peak Memory Bandwidth (GB/s): 224.320000
CUDA Fortran: Enquiring Device Properties
integer :: i, istat, nDevices
type (cudaDeviceProp) :: prop
istat = cudaGetDeviceCount(nDevices)
do i = 0, nDevices-1
istat = cudaGetDeviceProperties(prop, i)
write(6,*) 'Device Number: ', i
write(6,*) 'Device name: ', prop%name
...
end do

complete code for enquiring in

~ohaan/cuda_kurs_f/device_properties.cuf
CUDA Fortran: pgaccelinfo
> pgaccelinfo

CUDA Driver Version: 9000

NVRM version: NVIDIA UNIX x86_64 Kernel Module 384.111 Tue Dec 19 23:51:45 PST 2017

Device Number: 0
Device Name: GeForce GTX 1080
Device Revision Number: 6.1 Texture Alignment: 512B
Global Memory Size: 8508145664 Clock Rate: 1733 MHz
Number of Multiprocessors: 20 Execution Timeout: No
Concurrent Copy and Execution: Yes Integrated Device: No
Total Constant Memory: 65536 Can Map Host Memory: Yes
Total Shared Memory per Block: 49152 Compute Mode: default
Registers per Block: 65536 Concurrent Kernels: Yes
Warp Size: 32 ECC Enabled: No
Maximum Threads per Block: 1024 Memory Clock Rate: 5005 MHz
Maximum Block Dimensions: 1024, 1024, 64 Memory Bus Width: 256 bits
Maximum Grid Dimensions: 2147483647 x 65535 x 65535 L2 Cache Size: 2097152 bytes
Maximum Memory Pitch: 2147483647B Max Threads Per SMP: 2048
Async Engines: 2
Unified Addressing: Yes
Managed Memory: Yes
PGI Compiler Option: -ta=tesla:cc60
GPU-Properties of different nodes
GWDG NVIDIA Graphics Comp. Clock Device Band Number CUDA CUDA Perf. Ratio
node Modell Chip Capab. rate memory width of SMXes cores cores FP64/
[MHz] [GB] [GB/s] per per FP32
SMX SMX
(SP) (DP)
dte001- 2 Tesla K40 GK110 3.5 745 12 288 15 192 64 1:3
dte015 (Kepler)

dge008- 4 GeForce GM204 5.2 1126 4 224 16 128 4 1:32

dge015 980 (Maxwell)

dge001- 2 GeForce GP104 6.1 1733 8 320 20 128 4 1:32

dge007 1080 (Pascal)

agt001- 8 Tesla GV100 7.0 1380 34 898 80 64 32 1:2

agt002 V100 (Volta)

agq001- 4 Quadro TU104 7.5 1620 16 448 48 64 2 1:32

agq012 rtx5000 (Turing)
Selecting Different GPUs
• Compiling:
nvcc -arch=[sm_30|sm_35|sm_52|sm_61|sm_70|sm_75]
Set this flag according to compute capability of target GPU
Without setting this flag, nvcc compiles for compute capability 5.2

nvfortran –cuda -gpu=[cc35|cc50|cc60|cc70|cc75]

Without setting this flag, nvfortran compiles for compute capabilities of the gpus in
the compiling node, or for all compute capabilities, if no gpu is found.

• Selection of resources:
-G,--gpus= <type>:<n>
possible value for type :
gtx1080, gtx980, k40, v100, rtx5000
How to Use 2 GPUs simultaneously
• Device can be selected with cudaSetDevice(device_number)
• Prepare two executables: exe0 including cudaSetDevice(0)
exe1 including cudaSetDevice(1)

#!/bin/bash

#SBATCH -p gpu
#SBATCH -W 1:00
#SBATCH –N 1
#SBATCH –gpus-per-node=2 selects a node with 2 GPUs

./exe0 > out0 &

./exe1 > out1 & starts two executables asynchronously
CUDA Error Handling
• CUDA functions return an error code of type cudaError_t
 cudaError_t err = cudaMalloc(...)
• which can be translated into an error message by calling
 cudaGetErrorString(err)

• Errors in kernel functions can be enquired by

 kernel<<<grids,threads>>>(...);
 cudaDeviceSynchronize();
 cudaError_t err = cudaGetLastError();
cudaCheckError()
from https://gist.github.com/jefflarkin/5390993

//Macro for checking cuda errors following a cuda launch

or api call
#define cudaCheckError() { \
cudaError_t e=cudaGetLastError(); \
if(e!=cudaSuccess) { \
printf("Cuda failure %s:%d: '%s'\n“ \
,__FILE__,__LINE__,cudaGetErrorString(e));}\
}

macro code in
~ohaan/cuda_kurs/errchk.ut
Exercise: Large Vectors
• Run saxpy.cu with
N > 1024

• Maximal 1024 threads in a single block:

saxpy_d<<<1,N>>>( a, x_d, y_d )
gives unpredictable results for N > 1024
Large Vectors with multiple Threadblocks
• Maximal 1024 threads in a single block:
saxpy_d<<<1,N>>>( a, x_d, y_d )
gives unpredictable results for N > 1024

• Use N_block blocks:

 Modify host code
N_thrpb =1024; N_blks = (N+N_thrpb-1)/N_thrpb
saxpy_d<<<N_blocks,N_thrpb>>>( N, a, x_d, y_d )

 Modify device code

int i = threadIdx.x + blockIdx.x*blockDim.x;
if (i < N) y[i] = a*x[i] + y[i];
Add Large Vectors with Error Checking
#include "errchk.ut"

int main(void) {
int N = 8024, N_thrpb, N_blks, i;
float *x, *y;

cudaMallocManaged( &x, N*sizeof(float));

cudaMallocManaged( &y, N*sizeof(float));

for (i=0; i<N; i++) {

x[i] = 1.0f;
y[i] = 2.0f;
}

N_thrpb =1024; N_blks = (N+N_thrpb-1)/N_thrpb;

saxpy_d<<<N_blks,N_thrpb>>>(N,2.0f,x, y);
cudaDeviceSynchronize(); cudaCheckError();

float maxError = 0.0f;

for (int i = 0; i < N; i++)
maxError = max(maxError, abs(y[i]-4.0f));
printf("Max error: %f\n", maxError);
cudaFree(x); cudaFree(y);
}

complete code in
~ohaan/cuda_kurs/saxpy_large.cu
3 dim Grids and Blocks
• can be configured with the CUDA type dim3:
dim3 gdims(gdim_x,gdim_y,gdim_z);
dim3 bdims(bdim_x,bdim_y,bdim_z);
kernel <<<gdims,bdims>>> (...);
• This will launch a total number of
gdim_x*gdim_y*gdim_z*bdim_x*bdim_y*bdim_z
threads on the device
• At most (number of SMXes)*2048 threads will be executing at any
time
Vector Addition with 3-dim Grids and Blocks

global void saxpy_d( int N, int a, int x, int y )

{
int gridsize = gridDim.x * gridDim.y * gridDim.z;
int blocksize = blockDim.x * blockDim.y * blockDim.z;
int id_thr = threadIdx.x + threadIdx.y * blockDim.x
+ threadIdx.z * blockDim.x * blockDim.y;
int id_blk = blockIdx.x + blockIdx.y * gridDim.x
+ blockIdx.z * gridDim.x * gridDim.y;
int i = id_thr + blocksize * id_blk;
if (i < N) y[i] = x[i] + y[i];
}
2-dim Arrays & 2-dim Thread Block
int main(void){
int i, j, n = 4, m = 3;
float *x, *y;
size_t sizea = n*m*sizeof(int);

cudaMallocHost( &x, sizea);

cudaMallocHost( &y, sizea);

for(i=0; i<n; i++) {

for(j=0; j<m; j++) {
x[j+i*m] = 1.0f;
y[j+i*m] = 2.0f;
}
}

dim3 block(5,5);
saxpy_2d_d<<<1,block>>>(n, m, 2.0f, x, y);
cudaDeviceSynchronize(); cudaCheckError();

float maxError = 0.0f;

for (int i = 0; i < n*m; i++)
maxError = max(maxError, abs(y[i]-4.0f));
printf("Max error: %f\n", maxError);
cudaFree(x); cudaFree(y);
}

Host code complete code in

~ohaan/cuda_kurs/saxpy_2d.cu
Adding 2-dim Arrays with 2-dim Thread Blocks
__global__
void saxpy_2d_d( int n, int m, int *a, int *b, int *c ) {
int i, j, index;
j = threadIdx.x; i= threadIdx.y;
if( i<n && j<m ) {

index = i*m + j;

c[index] = a[index] + b[index] ;

}
}

device code

m columns
j = 0,…,m-1
threadIdx.y = 0,..., blockDim.y-1

n rows array index = threadIdx.x + threadIdx.y*m

i = 0,…,n-1
thread index = threadIdx.x + threadIdx.y*blockDim.x

threadIdx.x = 0,..., blockDim.x-1

Large Two-dimensional Arrays
int main(void)
{
int i, j, n = 10000, m = 8000;
float *x, *y;
size_t sizea = n*m*sizeof(int);

cudaMallocHost( &x, sizea);

cudaMallocHost( &y, sizea);

for(i=0; i<n; i++) {

for(j=0; j<m; j++) {
x[j+i*m] = 1.0f;
y[j+i*m] = 2.0f;
}
}
int bdim_x = 32, bdim_y = 32;
int gdim_x=(m+bdim_x-1)/bdim_x, gdim_y=(n+bdim_y-1)/bdim_y;
dim3 blk(bdim_x,bdim_y), grd(gdim_x,gdim_y);
saxpy_2d_d<<<grd,blk>>>(n, m, 2.0f, x, y); cudaCheckError();
cudaDeviceSynchronize(); cudaCheckError();

float maxError = 0.0f;

for (int i = 0; i < n*m; i++) maxError = max(maxError, abs(y[i]-4.0f));
printf("Max error: %f\n", maxError);
printf( "n=%i, m=%i, grddim_x=%i, grddim_y=%i, blkdim_x=%i, blkdim_y=%i\n",
n,m,gdim_x,gdim_y,bdim_x,bdim_y );
cudaFree(x); cudaFree(y);
}
complete code in
Host code ~ohaan/cuda_kurs/saxpy_2d_large.cu
Adding Large Two-dimensional Arrays
__global__
void saxpy_2d_d( int n, int m, float a, float *x, float *y )
{
int i, j, index;
j = blockIdx.x*blockDim.x+threadIdx.x;
i = blockIdx.y*blockDim.y+threadIdx.y;
if( i<n && j<m ) {
index = i*m + j;
y[index] = a*x[index] + y[index] ;
}
}

device code
j = 0,…,m-1

i = 0,…,n-1 n rows

m columns

2-dim array index : (i,j)

2-dim thread index : (blockIdx.xblockDim.x+threadIdx.x, blockIdx.yblockDim.y+threadIdx.y)

Timing of CUDA Codes
• Read internal clock before and after a code segment in host code
• Since kernel calls from host are asynchronous, host and device must
be synchronized by cudaDeviceSynchronize() before
calling the internal clock

(double) tstart = int_clock();

...
kernel<<<grids,threads>>>(...);
cudaDeviceSynchronize();
(double) tend = int_clock();
printf( "cpu time : %lf \n", tend-tstart );
Internal Clock for Elapsed Time
C-function gettimeofday returns elapsed time
with microsec precision.

#include <sys/time.h>
double get_el_time(){
struct timeval et;
gettimeofday ( &et,NULL);
return (double)et.tv_sec
+1.e+6*(double)et.tv_usec;
} code for get_ell_time in
~ohaan/cuda_kurs/time.ut
Timing of CUDA Code with CUDA Events
• Read internal clock before and after a code segment in host code

cudaEvent_t start, stop;

cudaEventCreate(&start);cudaEventCreate(&stop);
cudaEventRecord( start, 0 );
...
kernel<<<grids,threads>>>(...);
...
cudaEventRecord( stop, 0 );
cudaEventSynchronize( stop );
float et; cudaEventElapsedTime( &et,start, stop );
cudaEventDestroy( start );cudaEventDestroy( stop );
printf( "cpu time on device : %3.1f millisec \n", et );
Measuring Bandwidth for Adding Arrays
Array size : 10 000 x 10 000
Using unified memory with cudaMallocManaged

two dim block size

GPU type Nominal (1024,1) (32,32) (1,1024)

bandwidth
gtx1080 320 241 239 32

rtx5000 448 380 378 65

v100 898 754 728 135

code for bandwidth measurement in
~ohaan/cuda_kurs/saxpy_bw.cu
Memory Organization, Hardware View
Control unit:
Schedule, dispatch
...

SMX n
SMX 1
Shared 32 bit registers
CPU 1

...
L2 cache
cache
...

...
...

...
Main memory Main memory
Shared memory / L1 cache

constant + texture cache

Host Graphics Device Streaming Multiprocessor (SMX)

Device Memory , Software View

local mem

local mem
local mem

local mem
thread

thread

thread
. . . . . . . . .
block block

shared men shared men

global mem

constant mem
texture mem
Types of Kernel Variables: Local
• Variables (scalars and arrays) defined in the scope of a kernel are local
__global__ void ker1(int szaloc2,..){
int iloc1, iloc2;
float aloc1[6], *aloc2;
aloc2 = (float *)malloc(sizeof(float)*szaloc2);
...
}
• Each thread has its own set of local variables, which are placed in the
register files of the SMXes, or in global memory, if there are not enough
registers or if the variable is an indexed array
• Number of 32 bit registers per SMX 216=65536
• Maximal number of registers per thread 255
• For the maximal number of 2048 threads per SMX
number of registers per thread 32
Types of Kernel Variables: Global
• global variables (scalars and arrays) defined in the scope of the
application, reside in device main memory and are shared by all threads
in the kernels
• If allocated from host by calling cudaMalloc()
they can be accessed from host by cudaMemcpy(...)
• If allocated from host by calling cudaMallocManaged()
they can be accessed from host and from device with the same address
(unified memory)
• accessing the same global variable from a kernel by different threads is
not deterministic, since the order of execution for different blocks of
threads is not prescribed
Accessing Global Variables
• Device memory is accessed by load/store operations for aligned
memory segments of size 32, 64, or 128 Bytes
• If the 32 threads of a warp access 32 int or float variables lying
consecutively in memory, 4 load/store operations of 32 Byte segments
serve all 32 accesses (coalescent access)
• Compare the performance of 2-dim array addition on gtx1080

blockDim.x = 32, blockDim.y = 32; 239 GB/s

blockDim.x = 1, blockDim.y = 1024; 32 GB/s
Nominal bandwidth : 320 GB/s
Copy of Data in Global Memory

Measured bandwidth

Bandwidth [GB/s]
Using different blocksizes

System: NVIDIA Quadro rtx5000

Nominal bandwidth: 448 GB/s

Number of float data copied

Copy of Data in Global Memory

Measured bandwidth

Bandwidth [GB/s]
Using different blocksizes

System: NVIDIA Tesla 100

Nominal bandwidth: 898 GB/s

Number of float data copied

Types of Kernel Variables: Constant
• constant memory variables (scalars and arrays) defined in the scope of the
application, are read only, reside in device main memory are cached in the
constant cache of each SMX and are shared by all threads in the kernels.
• Allocated on device by __device__ __constant__ qualifier
__device__ __constant__ int sconst, float aconst[1024];
• Can be initialized from host with
cudaMemcpyToSymbol(aconst,a_h,1024*sizeof(float));
• If all threads in a kernel read the same data, the use of constant memory
variables reduces the accesses to device memory by employing the SMX’s
8 kB sized constant caches .
Types of Kernel Variables: Shared
• shared variables (scalars and arrays) are defined in the scope of block of threads
of a single kernel function and reside in the shared memory of the SMX executing
the block of threads
• All threads of a block have access to a block‘s shared variables
• Threads of other blocks cannot access a block‘s shared variables

• Static allocation of one or more shared arrays in a kernel function

__global__ void ker1(...){
__shared__ float sh_float[64]; __shared__ int sh_int[64];
...
}
• Dynamic allocation of shared memory (in one single shared array):
declaration outside kernel
extern __shared__ float sh_array[];
allocation in host code via extended execution configuration
size_t N_sh_bytes = 64*sizeof(float);
ker1<<<grid,block,N_sh_bytes>>>(…);
Device Synchronization from Host
• Synchroneous calls: cudaMalloc, cudaMemcpy,
cudaMallocManaged

• Asynchroneous calls: kernel<<<... >>> (...),

cudaMemcpyAsync,
cudaMemPrefetchAsync,
...

• A call in a host program to

cudaDeviceSynchronize();
will synchronize all previously started activities of the device
Thread Synchronization from Device
• In a device function, threads within a block can be synchronized by calling
the barrier
__syncthreads();
• Waits until all threads in a block have reached this instruction and all
accesses to global and shared memory from these threads are completed
• Danger of stalled execution:
if (i < cut ) __syncthreads();
will hang if in a block not all threads have i < cut or i>= cut
• Is used to coordinate memory access from threads within a single block
• __syncthreads()cannot coordinate the execution of threads from
different blocks
Atomic Operations
Example: accumulate the content of array b into memory
location a
• Sequential on host: for (i=0,i<n;i++) a = a + b[i] ;
• Parallel on kernel: if (i<n) a = a + b[i] ;

If several threads modify the content of the same address, the

result depends on the temporal order of their operations.
Atomic Operations
Thread 0 Thread 1
Read r1 from a r1 = 0 Read r1 from a r1 = 0
r2 = r1 + b[0] r2 = b[0] r2 = r1 + b[1] r2 = b[1]
write r2 to a a = b[0]
write r2 to a a = b[1]

Thread 0 Thread 1
Read r1 from a r1 = 0
r2 = r1 + b[0] r2 = b[0]
write r2 to a a = b[0]
Read r1 from a r1 = b[0]
r2 = r1 + b[1] r2 = b[0]+b[1]
write r2 to a a = b[0]+b[1]
Atomic Operations
atomicAdd : atomic for all CUDA threads in the current program
executing in the same compute device as the current thread.
atomicAdd_block : atomic for all CUDA threads in the current
program executing in the same thread block as the current thread
atomicAdd_system : atomic for all threads in the current program
including other CPUs and GPUs in the system

GPU’s with compute capability < 6.0

only support the device wide atomic functions
do not support atomicADD for double precision numbers
Atomic Operations
An atomic function performs a read-modify-write atomic operation on
one 32-bit or 64-bit word residing in global or shared memory. The
operation is atomic in the sense that it is guaranteed to be performed
without interference from other threads
Atomic add:
int atomicAdd(int* value, int incr);
CUDA Fortran:
atomicadd( value, incr )
type of value/incr can be integer(4), integer(8), real(4), real(8)
Many more atomic operations are supported:
cf. CUDA Toolkit Programming Guide B12
CUDA FORTRAN PROGRAMMING GUIDE AND REFERENCE 3.6.6

Exam 1 Sol
No ratings yet
Exam 1 Sol
8 pages
Porting Rockbox To YP-R0 (FAQ in The 1st Post) - Samsung R0 Rockbox - Mods - Abi - Forums
50% (2)
Porting Rockbox To YP-R0 (FAQ in The 1st Post) - Samsung R0 Rockbox - Mods - Abi - Forums
3 pages
CUDA Exercises
No ratings yet
CUDA Exercises
185 pages
Csnb594csnb4423 Lab 5 01a Harveen Velan Sw0104101
No ratings yet
Csnb594csnb4423 Lab 5 01a Harveen Velan Sw0104101
19 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
3-CUDA
No ratings yet
3-CUDA
5 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster
No ratings yet
Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster
17 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
CUDA
No ratings yet
CUDA
33 pages
2023-CSC14120-Lecture01-CUDAIntroduction
No ratings yet
2023-CSC14120-Lecture01-CUDAIntroduction
32 pages
HPC Int2 Key
No ratings yet
HPC Int2 Key
10 pages
LP 1,,1
No ratings yet
LP 1,,1
5 pages
CUDA_1
No ratings yet
CUDA_1
45 pages
01 Cuda c Basics
No ratings yet
01 Cuda c Basics
32 pages
Cheat Sheet CUDA
No ratings yet
Cheat Sheet CUDA
2 pages
cuda_mode_lecture2
No ratings yet
cuda_mode_lecture2
33 pages
Group A Assignment 4 (A) : Two Large Vectors
No ratings yet
Group A Assignment 4 (A) : Two Large Vectors
5 pages
Cuda C/C++ Basics: NVIDIA Corporation
No ratings yet
Cuda C/C++ Basics: NVIDIA Corporation
67 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
CUDA PPT Anurita Unit3
No ratings yet
CUDA PPT Anurita Unit3
42 pages
CUDA Programming Invert
No ratings yet
CUDA Programming Invert
36 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
Aca Lab Manual Final
No ratings yet
Aca Lab Manual Final
28 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
3-computation
No ratings yet
3-computation
28 pages
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
No ratings yet
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
3 pages
CUDA-Multiple GPUs
No ratings yet
CUDA-Multiple GPUs
36 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
Introduction To CUDA C 3
No ratings yet
Introduction To CUDA C 3
67 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
cuda
No ratings yet
cuda
4 pages
CUDA Putting It All Together
No ratings yet
CUDA Putting It All Together
39 pages
Introduccion CUDA C
No ratings yet
Introduccion CUDA C
51 pages
IntroGPUs
No ratings yet
IntroGPUs
36 pages
Ejercicio 2 Práctica 3: CUDA Desempeño en Función de La Homogeneidad para Acceder A Memoria y de La Regularidad Del Código
No ratings yet
Ejercicio 2 Práctica 3: CUDA Desempeño en Función de La Homogeneidad para Acceder A Memoria y de La Regularidad Del Código
8 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
CUDAProgModel
No ratings yet
CUDAProgModel
24 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
CUDA_part-1
No ratings yet
CUDA_part-1
52 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
chapter-8
No ratings yet
chapter-8
58 pages
Part2 22
No ratings yet
Part2 22
97 pages
CUDA Memory Architecture: GPGPU Class Week 4
No ratings yet
CUDA Memory Architecture: GPGPU Class Week 4
28 pages
Week 11
No ratings yet
Week 11
21 pages
Lec 1
No ratings yet
Lec 1
27 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
PDC assignment
No ratings yet
PDC assignment
9 pages
BECOA157 Parallel Matrix Multiplication
No ratings yet
BECOA157 Parallel Matrix Multiplication
3 pages
Threads
No ratings yet
Threads
54 pages
CUDA Programming Model
No ratings yet
CUDA Programming Model
14 pages
Pgi Cuda Tutorial
No ratings yet
Pgi Cuda Tutorial
58 pages
21.L18 Intro To GPU and CUDA C
No ratings yet
21.L18 Intro To GPU and CUDA C
89 pages
Laboratory Practice I (410246)
No ratings yet
Laboratory Practice I (410246)
28 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
Repaso Cap - 1 Shelly Cashman
100% (1)
Repaso Cap - 1 Shelly Cashman
4 pages
Motherboard Types and Features
No ratings yet
Motherboard Types and Features
6 pages
Compatible Boards Processor I3-10100
No ratings yet
Compatible Boards Processor I3-10100
4 pages
Research Project Mayormente - Marvin
No ratings yet
Research Project Mayormente - Marvin
11 pages
CPU Generation: Version 0B
No ratings yet
CPU Generation: Version 0B
33 pages
Super Premium: Luxury With Performance
100% (1)
Super Premium: Luxury With Performance
2 pages
Software Requirements Specifications (Template)
No ratings yet
Software Requirements Specifications (Template)
6 pages
VMS Basic Training
No ratings yet
VMS Basic Training
23 pages
PM572, PM573, PM582, PM583, PM585, PM590, PM591, PM592: Processor Module
No ratings yet
PM572, PM573, PM582, PM583, PM585, PM590, PM591, PM592: Processor Module
19 pages
"Cache Memory" in (Microprocessor and Assembly Language) : Lecture-20
No ratings yet
"Cache Memory" in (Microprocessor and Assembly Language) : Lecture-20
19 pages
a
No ratings yet
a
8 pages
Catalog
No ratings yet
Catalog
12 pages
Arduino Leonardo/Micro As Game Controller/Joystick: Technology Workshop Craft Home Food Play Outside Costumes
No ratings yet
Arduino Leonardo/Micro As Game Controller/Joystick: Technology Workshop Craft Home Food Play Outside Costumes
7 pages
Fundamentals of Computer Architecture: 1st Edition
No ratings yet
Fundamentals of Computer Architecture: 1st Edition
1 page
OV5640 Camera Module Hardware Application Notes R1.0英文手册
No ratings yet
OV5640 Camera Module Hardware Application Notes R1.0英文手册
22 pages
Operating Instructions: Device Platform
No ratings yet
Operating Instructions: Device Platform
64 pages
Advance Host Controller Interface
No ratings yet
Advance Host Controller Interface
11 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
44 pages
ITC - Latest ITC Docket Entries Sept 03, 2009 - 090309
No ratings yet
ITC - Latest ITC Docket Entries Sept 03, 2009 - 090309
203 pages
UEFI Boot With Clover
No ratings yet
UEFI Boot With Clover
7 pages
All-Products Esuprt Electronics Accessories Esuprt Electronics Accessories Monitors Dell-E2221hn-Monitor User's-Guide En-Us
No ratings yet
All-Products Esuprt Electronics Accessories Esuprt Electronics Accessories Monitors Dell-E2221hn-Monitor User's-Guide En-Us
45 pages
ES Lab Manual
No ratings yet
ES Lab Manual
117 pages
Bartending NC II
100% (2)
Bartending NC II
80 pages
Profibus DP Fieldbus Option Board User'S Manual: Vacon CX / CXL Frequency Converters
No ratings yet
Profibus DP Fieldbus Option Board User'S Manual: Vacon CX / CXL Frequency Converters
15 pages
Asynchronous Data Transfer Modes of Transfer Priority Interrupt Direct Memory Access Input-Output Processor Serial Communication
No ratings yet
Asynchronous Data Transfer Modes of Transfer Priority Interrupt Direct Memory Access Input-Output Processor Serial Communication
49 pages
Quotation: Hadharat Al Iraq For Telecommunications and Information Technology Solutions Co. Ltd. # Q-000249
No ratings yet
Quotation: Hadharat Al Iraq For Telecommunications and Information Technology Solutions Co. Ltd. # Q-000249
2 pages
Foresee-Ncemasld-32g C520992
No ratings yet
Foresee-Ncemasld-32g C520992
29 pages
Ict Set A PDF
No ratings yet
Ict Set A PDF
2 pages