CUDA Lab Instruction
CUDA Lab Instruction
CUDA Lab Instruction
NVIDIA CUDA C
Programming Guide
Version 4.0
5/6/2011
Chapter 1.
Introduction
The reason behind the discrepancy in floating-point capability between the CPU and
the GPU is that the GPU is specialized for compute-intensive, highly parallel
computation – exactly what graphics rendering is about – and therefore designed
such that more transistors are devoted to data processing rather than data caching
and flow control, as schematically illustrated by Figure 1-2.
ALU ALU
Cache
DRAM DRAM
CPU GPU
More specifically, the GPU is especially well-suited to address problems that can be
expressed as data-parallel computations – the same program is executed on many
data elements in parallel – with high arithmetic intensity – the ratio of arithmetic
operations to memory operations. Because the same program is executed for each
data element, there is a lower requirement for sophisticated flow control, and
because it is executed on many data elements and has high arithmetic intensity, the
memory access latency can be hidden with calculations instead of big data caches.
Data-parallel processing maps data elements to parallel processing threads. Many
applications that process large data sets can use a data-parallel programming model
to speed up the computations. In 3D rendering, large sets of pixels and vertices are
mapped to parallel threads. Similarly, image and media processing applications such
as post-processing of rendered images, video encoding and decoding, image scaling,
stereo vision, and pattern recognition can map image blocks and pixels to parallel
processing threads. In fact, many algorithms outside the field of image rendering
and processing are accelerated by data-parallel processing, from general signal
processing or physics simulation to computational finance or computational biology.
Block 4 Block 5
Block 6 Block 7
A multithreaded program is partitioned into blocks of threads that execute independently from each
other, so that a GPU with more cores will automatically execute the program in less time than a GPU
with fewer cores.
This chapter introduces the main concepts behind the CUDA programming model
by outlining how they are exposed in C. An extensive description of CUDA C is
given in Chapter 3.
Full code for the vector addition example used in this chapter and the next can be
found in the vectorAdd SDK code sample.
2.1 Kernels
CUDA C extends C by allowing the programmer to define C functions, called
kernels, that, when called, are executed N times in parallel by N different CUDA
threads, as opposed to only once like regular C functions.
A kernel is defined using the __global__ declaration specifier and the number of
CUDA threads that execute that kernel for a given kernel call is specified using a
new <<<…>>> execution configuration syntax (see Appendix B.16). Each thread that
executes the kernel is given a unique thread ID that is accessible within the kernel
through the built-in threadIdx variable.
As an illustration, the following sample code adds two vectors A and B of size N
and stores the result into vector C:
// Kernel definition
__global__ void VecAdd(float* A, float* B, float* C)
{
int i = threadIdx.x;
C[i] = A[i] + B[i];
}
int main()
{
...
// Kernel invocation with N threads
VecAdd<<<1, N>>>(A, B, C);
}
Here, each of the N threads that execute VecAdd() performs one pair-wise
addition.
int main()
{
...
// Kernel invocation with one block of N * N * 1 threads
int numBlocks = 1;
dim3 threadsPerBlock(N, N);
MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
}
There is a limit to the number of threads per block, since all threads of a block are
expected to reside on the same processor core and must share the limited memory
resources of that core. On current GPUs, a thread block may contain up to 1024
threads.
However, a kernel can be executed by multiple equally-shaped thread blocks, so that
the total number of threads is equal to the number of threads per block times the
number of blocks.
Blocks are organized into a one-dimensional, two-dimensional, or three-dimensional
grid of thread blocks as illustrated by Figure 2-1. The number of thread blocks in a
grid is usually dictated by the size of the data being processed or the number of
processors in the system, which it can greatly exceed.
Grid
Block (1, 1)
The number of threads per block and the number of blocks per grid specified in the
<<<…>>> syntax can be of type int or dim3. Two-dimensional blocks or grids can
be specified as in the example above.
Each block within the grid can be identified by a one-dimensional, two-dimensional,
or three-dimensional index accessible within the kernel through the built-in
blockIdx variable. The dimension of the thread block is accessible within the
kernel through the built-in blockDim variable.
Extending the previous MatAdd() example to handle multiple blocks, the code
becomes as follows.
// Kernel definition
__global__ void MatAdd(float A[N][N], float B[N][N],
float C[N][N])
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i < N && j < N)
C[i][j] = A[i][j] + B[i][j];
int main()
{
...
// Kernel invocation
dim3 threadsPerBlock(16, 16);
dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);
MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
}
A thread block size of 16x16 (256 threads), although arbitrary in this case, is a
common choice. The grid is created with enough blocks to have one thread per
matrix element as before. For simplicity, this example assumes that the number of
threads per grid in each dimension is evenly divisible by the number of threads per
block in that dimension, although that need not be the case.
Thread blocks are required to execute independently: It must be possible to execute
them in any order, in parallel or in series. This independence requirement allows
thread blocks to be scheduled in any order across any number of cores as illustrated
by Figure 1-4, enabling programmers to write code that scales with the number of
cores.
Threads within a block can cooperate by sharing data through some shared memory
and by synchronizing their execution to coordinate memory accesses. More
precisely, one can specify synchronization points in the kernel by calling the
__syncthreads() intrinsic function; __syncthreads() acts as a barrier at
which all threads in the block must wait before any is allowed to proceed.
Section 3.2.3 gives an example of using shared memory.
For efficient cooperation, the shared memory is expected to be a low-latency
memory near each processor core (much like an L1 cache) and __syncthreads()
is expected to be lightweight.
Thread
Per-thread local
memory
Thread Block
Per-block shared
memory
Grid 0
Grid 1
Global memory
Block (0, 0) Block (1, 0)
The CUDA programming model also assumes that both the host and the device
maintain their own separate memory spaces in DRAM, referred to as host memory and
device memory, respectively. Therefore, a program manages the global, constant, and
texture memory spaces visible to kernels through calls to the CUDA runtime
(described in Chapter 3). This includes device memory allocation and deallocation as
well as data transfer between host and device memory.
C Program
Sequential
Execution
Device
Parallel kernel
Grid 1
Kernel1<<<>>>()
Serial code executes on the host while parallel code executes on the device.
B.1.1 __device__
The __device__ qualifier declares a function that is:
Executed on the device
Callable from the device only.
B.1.2 __global__
The __global__ qualifier declares a function as being a kernel. Such a function is:
Executed on the device,
Callable from the host only.
__global__ functions must have void return type.
Any call to a __global__ function must specify its execution configuration as
described in Section B.16.
A call to a __global__ function is asynchronous, meaning it returns before the
device has completed its execution.
B.1.3 __host__
The __host__ qualifier declares a function that is:
Executed on the host,
Callable from the host only.
B.2.1 __device__
The __device__ qualifier declares a variable that resides on the device.
At most one of the other type qualifiers defined in the next three sections may be
used together with __device__ to further specify which memory space the
variable belongs to. If none of them is present, the variable:
Resides in global memory space,
Has the lifetime of an application,
Is accessible from all the threads within the grid and from the host through the
runtime library (cudaGetSymbolAddress() / cudaGetSymbolSize() /
cudaMemcpyToSymbol() / cudaMemcpyFromSymbol() for the runtime
API and cuModuleGetGlobal() for the driver API).
B.2.2 __constant__
The __constant__ qualifier, optionally used together with __device__,
declares a variable that:
Resides in constant memory space,
Has the lifetime of an application,
Is accessible from all the threads within the grid and from the host through the
runtime library (cudaGetSymbolAddress() / cudaGetSymbolSize() /
cudaMemcpyToSymbol() / cudaMemcpyFromSymbol() for the runtime
API and cuModuleGetGlobal() for the driver API).
B.2.3 __shared__
The __shared__ qualifier, optionally used together with __device__, declares a
variable that:
Resides in the shared memory space of a thread block,
Has the lifetime of the block,
Is only accessible from all the threads within the block.
When declaring a variable in shared memory as an external array such as
extern __shared__ float shared[];
the size of the array is determined at launch time (see Section B.16). All variables
declared in this fashion, start at the same address in memory, so that the layout of
the variables in the array must be explicitly managed through offsets. For example, if
one wants the equivalent of
short array0[128];
float array1[64];
int array2[256];
in dynamically allocated shared memory, one could declare and initialize the arrays
the following way:
extern __shared__ float array[];
__device__ void func() // __device__ or __global__ function
{
short* array0 = (short*)array;
float* array1 = (float*)&array0[128];
int* array2 = (int*)&array1[64];
}
Note that pointers need to be aligned to the type they point to, so the following
code, for example, does not work since array1 is not aligned to 4 bytes.
extern __shared__ float array[];
__device__ void func() // __device__ or __global__ function
{
short* array0 = (short*)array;
float* array1 = (float*)&array0[127];
}
Alignment requirements for the built-in vector types are listed in Table B-1.
B.2.4 __restrict__
nvcc supports restricted pointers via the __restrict__ keyword.
Restricted pointers were introduced in C99 to alleviate the aliasing problem that
exists in C-type languages, and which inhibits all kind of optimization from code re-
ordering to common sub-expression elimination.
Here is an example subject to the aliasing issue, where use of restricted pointer can
help the compiler to reduce the number of instructions:
void foo(const float* a,
const float* b,
float* c)
{
c[0] = a[0] * b[0];
c[1] = a[0] * b[0];
c[2] = a[0] * b[0] * a[1];
c[3] = a[0] * a[1];
c[4] = a[0] * b[0];
c[5] = b[0];
...
}
In C-type languages, the pointers a, b, and c may be aliased, so any write through c
could modify elements of a or b. This means that to guarantee functional
correctness, the compiler cannot load a[0] and b[0] into registers, multiply them,
and store the result to both c[0] and c[1], because the results would differ from
the abstract execution model if, say, a[0] is really the same location as c[0]. So
the compiler cannot take advantage of the common sub-expression. Likewise,
the compiler cannot just reorder the computation of c[4] into the proximity of the
computation of c[0] and c[1] because the preceding write to c[3] could change
the inputs to the computation of c[4].
By making a, b, and c restricted pointers, the programmer asserts to the compiler
that the pointers are in fact not aliased, which in this case means writes through c
would never overwrite elements of a or b. This changes the function prototype as
follows:
void foo(const float* __restrict__ a,
const float* __restrict__ b,
float* __restrict__ c);
Note that all pointer arguments need to be made restricted for the compiler
optimizer to derive any benefit. With the __restrict keywords added, the
compiler can now reorder and do common sub-expression elimination at will, while
retaining functionality identical with the abstract execution model:
void foo(const float* __restrict__ a,
const float* __restrict__ b,
float* __restrict__ c)
{
float t0 = a[0];
float t1 = b[0];
float t2 = t0 * t2;
float t3 = a[1];
c[0] = t2;
c[1] = t2;
c[4] = t2;
c[2] = t2 * t3;
c[3] = t0 * t3;
c[5] = t1;
...
}
The effects here are a reduced number of memory accesses and reduced number of
computations. This is balanced by an increase in register pressure due to "cached"
loads and common sub-expressions.
Since register pressure is a critical issue in many CUDA codes, use of restricted
pointers can have negative performance impact on CUDA code, due to reduced
occupancy.
B.3.2 dim3
This type is an integer vector type based on uint3 that is used to specify
dimensions. When defining a variable of type dim3, any component left unspecified
is initialized to 1.
B.4.1 gridDim
This variable is of type dim3 (see Section B.3.2) and contains the dimensions of the
grid.
B.4.2 blockIdx
This variable is of type uint3 (see Section B.3.1) and contains the block index
within the grid.
B.4.3 blockDim
This variable is of type dim3 (see Section B.3.2) and contains the dimensions of the
block.
B.4.4 threadIdx
This variable is of type uint3 (see Section B.3.1) and contains the thread index
within the block.
B.4.5 warpSize
This variable is of type int and contains the warp size in threads (see Section 4.1
for the definition of a warp).
if (threadIdx.x == 0) {
if (isLastBlockDone) {
if (threadIdx.x == 0) {
B.8.1 tex1Dfetch()
template<class DataType>
Type tex1Dfetch(
texture<DataType, cudaTextureType1D,
cudaReadModeElementType> texRef,
int x);
float tex1Dfetch(
texture<unsigned char, cudaTextureType1D,
cudaReadModeNormalizedFloat> texRef,
int x);
float tex1Dfetch(
texture<signed char, cudaTextureType1D,
cudaReadModeNormalizedFloat> texRef,
int x);
float tex1Dfetch(
texture<unsigned short, cudaTextureType1D,
cudaReadModeNormalizedFloat> texRef,
int x);
float tex1Dfetch(
texture<signed short, cudaTextureType1D,
cudaReadModeNormalizedFloat> texRef,
int x);
fetch the region of linear memory bound to texture reference texRef using integer
texture coordinate x. No texture filtering and addressing modes are supported. For
integer types, these functions may optionally promote the integer to single-precision
floating point.
Besides the functions shown above, 2-, and 4-tuples are supported; for example:
float4 tex1Dfetch(
texture<uchar4, cudaTextureType1D,
cudaReadModeNormalizedFloat> texRef,
int x);
fetches the region of linear memory bound to texture reference texRef using
texture coordinate x.
B.8.2 tex1D()
template<class DataType, enum cudaTextureReadMode readMode>
Type tex1D(texture<DataType, cudaTextureType1D, readMode> texRef,
float x);
fetches the CUDA array bound to texture reference texRef using texture
coordinate x.
B.8.3 tex2D()
template<class DataType, enum cudaTextureReadMode readMode>
Type tex2D(texture<DataType, cudaTextureType2D, readMode> texRef,
float x, float y);
fetches the CUDA array or the region of linear memory bound to texture reference
texRef using texture coordinates x and y.
B.8.4 tex3D()
template<class DataType, enum cudaTextureReadMode readMode>
Type tex3D(texture<DataType, cudaTextureType3D, readMode> texRef,
float x, float y, float z);
fetches the CUDA array bound to texture reference texRef using texture
coordinates x, y, and z.
B.8.5 tex1DLayered()
template<class DataType, enum cudaTextureReadMode readMode>
Type tex1DLayered(
texture<DataType, cudaTextureType1DLayered, readMode> texRef,
float x, int layer);
fetches the CUDA array bound to texture reference texRef using texture
coordinate x.
B.8.6 tex2DLayered()
template<class DataType, enum cudaTextureReadMode readMode>
Type tex2DLayered(
texture<DataType, cudaTextureType2DLayered, readMode> texRef,
float x, float y, int layer);
fetches the CUDA array bound to texture reference texRef using texture
coordinates x and y.
B.9.1 surf1Dread()
template<class Type>
Type surf1Dread(surface<void, 1> surfRef, int x,
boundaryMode = cudaBoundaryModeTrap);
reads the CUDA array bound to surface reference surfRef using coordinate x.
B.9.2 surf1Dwrite()
template<class Type>
void surf1Dwrite(Type data, surface<void, 1> surfRef, int x,
boundaryMode = cudaBoundaryModeTrap);
writes value data to the CUDA array bound to surface reference surfRef at
coordinate x.
B.9.3 surf2Dread()
template<class Type>
Type surf2Dread(surface<void, 2> surfRef,
int x, int y,
boundaryMode = cudaBoundaryModeTrap);
reads the CUDA array bound to surface reference surfRef using coordinates x
and y.
B.9.4 surf2Dwrite()
template<class Type>
void surf2Dwrite(Type data, surface<void, 2> surfRef,
int x, int y,
boundaryMode = cudaBoundaryModeTrap);
writes value data to the CUDA array bound to surface reference surfRef at
coordinate x and y.
__longlong_as_double(assumed)));
} while (assumed != old);
return __longlong_as_double(old);
}
B.11.1.2 atomicSub()
int atomicSub(int* address, int val);
unsigned int atomicSub(unsigned int* address,
unsigned int val);
reads the 32-bit word old located at the address address in global or shared
memory, computes (old - val), and stores the result back to memory at the
same address. These three operations are performed in one atomic transaction. The
function returns old.
B.11.1.3 atomicExch()
int atomicExch(int* address, int val);
unsigned int atomicExch(unsigned int* address,
unsigned int val);
unsigned long long int atomicExch(unsigned long long int* address,
unsigned long long int val);
float atomicExch(float* address, float val);
reads the 32-bit or 64-bit word old located at the address address in global or
shared memory and stores val back to memory at the same address. These two
operations are performed in one atomic transaction. The function returns old.
B.11.1.4 atomicMin()
int atomicMin(int* address, int val);
unsigned int atomicMin(unsigned int* address,
unsigned int val);
reads the 32-bit word old located at the address address in global or shared
memory, computes the minimum of old and val, and stores the result back to
memory at the same address. These three operations are performed in one atomic
transaction. The function returns old.
B.11.1.5 atomicMax()
int atomicMax(int* address, int val);
unsigned int atomicMax(unsigned int* address,
unsigned int val);
reads the 32-bit word old located at the address address in global or shared
memory, computes the maximum of old and val, and stores the result back to
memory at the same address. These three operations are performed in one atomic
transaction. The function returns old.
B.11.1.6 atomicInc()
unsigned int atomicInc(unsigned int* address,
unsigned int val);
reads the 32-bit word old located at the address address in global or shared
memory, computes ((old >= val) ? 0 : (old+1)), and stores the result
back to memory at the same address. These three operations are performed in one
atomic transaction. The function returns old.
B.11.1.7 atomicDec()
unsigned int atomicDec(unsigned int* address,
unsigned int val);
reads the 32-bit word old located at the address address in global or shared
memory, computes (((old == 0) | (old > val)) ? val : (old-1)),
and stores the result back to memory at the same address. These three operations
are performed in one atomic transaction. The function returns old.
B.11.1.8 atomicCAS()
int atomicCAS(int* address, int compare, int val);
unsigned int atomicCAS(unsigned int* address,
unsigned int compare,
unsigned int val);
unsigned long long int atomicCAS(unsigned long long int* address,
unsigned long long int compare,
unsigned long long int val);
reads the 32-bit or 64-bit word old located at the address address in global or
shared memory, computes (old == compare ? val : old), and stores the
result back to memory at the same address. These three operations are performed in
one atomic transaction. The function returns old (Compare And Swap).
B.11.2.2 atomicOr()
int atomicOr(int* address, int val);
unsigned int atomicOr(unsigned int* address,
unsigned int val);
reads the 32-bit word old located at the address address in global or shared
memory, computes (old | val), and stores the result back to memory at the
same address. These three operations are performed in one atomic transaction. The
function returns old.
B.11.2.3 atomicXor()
int atomicXor(int* address, int val);
unsigned int atomicXor(unsigned int* address,
unsigned int val);
reads the 32-bit word old located at the address address in global or shared
memory, computes (old ^ val), and stores the result back to memory at the
same address. These three operations are performed in one atomic transaction. The
function returns old.
more details). All counters are reset before each kernel call (note that when an
application is run via a CUDA debugger or profiler (cuda-gdb, CUDA Visual
Profiler, Parallel Nsight), all launches are synchronous).
B.14.2 Limitations
Final formatting of the printf() output takes place on the host system. This
means that the format string must be understood by the host-system‟s compiler and
C library. Every effort has been made to ensure that the format specifiers supported
by CUDA‟s printf function form a universal subset from the most common host
compilers, but exact behavior will be host-O/S-dependent.
As described in Section B.14.1, printf() will accept all combinations of valid flags
and types. This is because it cannot determine what will and will not be valid on the
host system where the final output is formatted. The effect of this is that output
may be undefined if the program emits a format string which contains invalid
combinations.
The printf() command can accept at most 32 arguments in addition to the
format string. Additional arguments beyond this will be ignored, and the format
specifier output as-is.
Owing to the differing size of the long type on 64-bit Windows platforms (four
bytes on 64-bit Windows platforms, eight bytes on other 64-bit platforms), a kernel
which is compiled on a non-Windows 64-bit machine but then run on a win64
machine will see corrupted output for all format strings which include “%ld”. It is
recommended that the compilation platform matches the execution platform to
ensure safety.
The output buffer for printf() is set to a fixed size before kernel launch (see
Section B.14.3). It is circular and if more output is produced during kernel execution
than can fit in the buffer, older output is overwritten. It is flushed only when one of
these actions is performed:
Kernel launch via <<<>>> or cuLaunchKernel() (at the start of the launch,
and if the CUDA_LAUNCH_BLOCKING environment variable is set to 1, at
the end of the launch as well),
Synchronization via cudaDeviceSynchronize(),
cuCtxSynchronize(), cudaStreamSynchronize(),
cuStreamSynchronize(), cudaEventSynchronize(), or
cuEventSynchronize(),
Memory copies via any blocking version of cudaMemcpy*() or
cuMemcpy*(),
Module loading/unloading via cuModuleLoad() or cuModuleUnload(),
Context destruction via cudaDeviceReset() or cuCtxDestroy().
Note that the buffer is not flushed automatically when the program exits. The user
must call cudaDeviceReset() or cuCtxDestroy() explicitly, as shown in the
examples below.
B.14.4 Examples
The following code sample:
// printf() is only supported
// for devices of compute capability 2.0 and above
#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ < 200)
#define printf(f, ...) ((void)(f, __VA_ARGS__),0)
#endif
void main()
{
helloCUDA<<<1, 5>>>(1.2345f);
cudaDeviceReset();
}
will output:
Hello thread 2, f=1.2345
Hello thread 1, f=1.2345
Hello thread 4, f=1.2345
Hello thread 0, f=1.2345
Hello thread 3, f=1.2345
Notice how each thread encounters the printf() command, so there are as many
lines of output as there were threads launched in the grid. As expected, global values
(i.e. float f) are common between all threads, and local values
(i.e. threadIdx.x) are distinct per-thread.
The following code sample:
__global__ void helloCUDA(float f)
{
if (threadIdx.x == 0)
printf(“Hello thread %d, f=%f\n”, threadIdx.x, f) ;
}
void main()
{
helloCUDA<<<1, 5>>>(1.2345f);
cudaDeviceReset();
}
will output:
Hello thread 0, f=1.2345
Self-evidently, the if() statement limits which threads will call printf, so that
only a single line of output is seen.
Heap size cannot be changed once a module load has occurred and it does not
resize dynamically according to need.
Memory reserved for the device heap is in addition to memory allocated through
host-side CUDA API calls such as cudaMalloc().
B.15.3 Examples
B.15.3.1 Per Thread Allocation
The following code sample:
#include <stdlib.h>
#include <stdio.h>
void main()
{
// Set a heap size of 128 megabytes. Note that this must
// be done before any kernel is launched.
cudaDeviceSetLimit(cudaLimitMallocHeapSize, 128*1024*1024);
mallocTest<<<1, 5>>>();
cudaDeviceSynchronize();
}
will output:
Thread 0 got pointer: 00057020
Thread 1 got pointer: 0005708c
Thread 2 got pointer: 000570f8
Thread 3 got pointer: 00057164
Thread 4 got pointer: 000571d0
Notice how each thread encounters the malloc() command and so receives its
own allocation. (Exact pointer values will vary: these are illustrative.)
void main()
{
cudaDeviceSetLimit(cudaLimitMallocHeapSize, 128*1024*1024);
mallocTest<<<10, 128>>>();
cudaDeviceSynchronize();
}
#define NUM_BLOCKS 20
if (dataptr[blockIdx.x] == NULL)
return;
void main()
{
cudaDeviceSetLimit(cudaLimitMallocHeapSize, 128*1024*1024);
// Allocate memory
allocmem<<< NUM_BLOCKS, 10 >>>();
// Use memory
usemem<<< NUM_BLOCKS, 10 >>>();
usemem<<< NUM_BLOCKS, 10 >>>();
usemem<<< NUM_BLOCKS, 10 >>>();
// Free memory
freemem<<< NUM_BLOCKS, 10 >>>();
cudaDeviceSynchronize();
}
When using the runtime API (Section 3.2), the execution configuration is specified
by inserting an expression of the form <<< Dg, Db, Ns, S >>> between the
function name and the parenthesized argument list, where:
Dg is of type dim3 (see Section B.3.2) and specifies the dimension and size of
the grid, such that Dg.x * Dg.y * Dg.z equals the number of blocks being
launched; Dg.z must be equal to 1 for devices of compute capability 1.x;
Db is of type dim3 (see Section B.3.2) and specifies the dimension and size of
each block, such that Db.x * Db.y * Db.z equals the number of threads
per block;
Ns is of type size_t and specifies the number of bytes in shared memory that
is dynamically allocated per block for this call in addition to the statically
allocated memory; this dynamically allocated memory is used by any of the
variables declared as an external array as mentioned in Section B.2.3; Ns is an
optional argument which defaults to 0;
S is of type cudaStream_t and specifies the associated stream; S is an
optional argument which defaults to 0.
As an example, a function declared as
__global__ void Func(float* parameter);
must be called like this:
Func<<< Dg, Db, Ns >>>(parameter);
The arguments to the execution configuration are evaluated before the actual
function arguments and like the function arguments, are currently passed via shared
memory to the device.
The function call will fail if Dg or Db are greater than the maximum sizes allowed
for the device as specified in Appendix F, or if Ns is greater than the maximum
amount of shared memory available on the device, minus the amount of shared
memory required for static allocation, functions arguments (for devices of compute
capability 1.x), and execution configuration.
// Device code
__global__ void
__launch_bounds__(MY_KERNEL_MAX_THREADS, MY_KERNEL_MIN_BLOCKS)
MyKernel(...)
{
...
}
In the common case where MyKernel is invoked with the maximum number of
threads per block (specified as the first parameter of __launch_bounds__()), it
is tempting to use MY_KERNEL_MAX_THREADS as the number of threads per block
in the execution configuration:
// Host code
MyKernel<<<blocksPerGrid, MY_KERNEL_MAX_THREADS>>>(...);
This will not work however since __CUDA_ARCH__ is undefined in host code as
mentioned in Section 3.1.4, so MyKernel will launch with 256 threads per block
even when __CUDA_ARCH__ is greater or equal to 200. Instead the number of
threads per block should be determined:
Either at compile time using a macro that does not depend on
__CUDA_ARCH__, for example
// Host code
MyKernel<<<blocksPerGrid, THREADS_PER_BLOCK>>>(...);
Or at runtime based on the compute capability
// Host code
cudaGetDeviceProperties(&deviceProp, device);
int threadsPerBlock =
(deviceProp.major >= 2 ?
2 * THREADS_PER_BLOCK : THREADS_PER_BLOCK);
MyKernel<<<blocksPerGrid, threadsPerBlock>>>(...);
Register usage is reported by the --ptxas-options=-v compiler option. The
number of resident blocks can be derived from the occupancy reported by the
CUDA profiler (see Section 5.2.3 for a definition of occupancy).
Register usage can also be controlled for all __global__ functions in a file using
the -maxrregcount compiler option. The value of -maxrregcount is ignored
for functions with launch bounds.