CUDA C Programming Guide
CUDA C Programming Guide
NVIDIA CUDA C
Programming Guide
Version 3.2
11/9/2010
Changes from Version 3.1.1
Simplified all the code samples that use cuParamSetv() to set a kernel
parameter of type CUdeviceptr since CUdeviceptr is now of same size and
alignment as void*, so there is no longer any need to go through an
interneditate void* variable.
Added Section 3.2.4.1.4 on 16-bit floating-point textures.
Added Section 3.2.4.4 on read/write coherency for texture and surface memory.
Added more details about surface memory access to Section 3.2.4.2.
Added more details to Section 3.2.6.5.
Mentioned new stream synchronization function cudaStreamSynchronize()
in Section 3.2.6.5.2.
Mentioned in Sections 3.2.7.2, 3.3.10.2, and 4.3 the new API calls to deal with
devices using NVIDIA SLI in AFR mode.
Added Sections 3.2.9 and 3.3.12 about the call stack.
Changed the type of the pitch variable in the second code sample of
Section 3.3.4 from unsigned int to size_t following the function
signature change of cuMemAllocPitch().
Changed the type of the bytes variable in the last code sample of Section 3.3.4
from unsigned int to size_t following the function signature change of
cuModuleGetGlobal().
Removed cuParamSetTexRef() from Section 3.3.7 as it is no longer
necessary.
Updated Section 5.2.3, Table 5-1, and Section G.4.1 for devices of compute
capability 2.1.
Added GeForce GTX 480M, GeForce GTX 470M, GeForce GTX 460M,
GeForce GTX 445M, GeForce GTX 435M, GeForce GTX 425M,
GeForce GTX 420M, GeForce GTX 415M, GeForce GTX 460,
GeForce GTS 450, GeForce GTX 465, GeForce GTX 580, Quadro 2000,
Quadro 600, Quadro 4000, Quadro 5000, Quadro 5000M, and Quadro 6000 to
Table A-1.
Fixed sample code in Section B.2.3: array[] was declared as an array of char
causing a compiler error (“Unaligned memory accesses not supported”) when
casting array to a pointer of higher alignment requirement; declaring
array[] as an array of float fixes it.
Mentioned in Section B.11 that any atomic operation can be implemented based
on atomic Compare And Swap.
Added Section B.15 on the new malloc() and free() device functions.
Moved the type casting functions to a separate section C.2.4.
Fixed the maximum height of a 2D texture reference for devices of compute
capability 2.x (65535 instead of 65536) in Section G.1.
Figure 1-1. Floating-Point Operations per Second and Memory Bandwidth for the CPU
and GPU 2
Figure 1-2. The GPU Devotes More Transistors to Data Processing ............................ 3
Figure 1-3. CUDA is Designed to Support Various Languages or Application
Programming Interfaces .................................................................................... 4
Figure 1-4. Automatic Scalability ............................................................................ 5
Figure 2-1. Grid of Thread Blocks ........................................................................... 9
Figure 2-2. Memory Hierarchy .............................................................................. 11
Figure 2-3. Heterogeneous Programming .............................................................. 13
Figure 3-1. Matrix Multiplication without Shared Memory ........................................ 24
Figure 3-2. Matrix Multiplication with Shared Memory ............................................ 28
Figure 3-3. Library Context Management .............................................................. 55
Figure 3-4. The Driver API is Backward, but Not Forward Compatible ...................... 79
The reason behind the discrepancy in floating-point capability between the CPU and
the GPU is that the GPU is specialized for compute-intensive, highly parallel
computation – exactly what graphics rendering is about – and therefore designed
such that more transistors are devoted to data processing rather than data caching
and flow control, as schematically illustrated by Figure 1-2.
ALU ALU
Cache
DRAM DRAM
CPU GPU
More specifically, the GPU is especially well-suited to address problems that can be
expressed as data-parallel computations – the same program is executed on many
data elements in parallel – with high arithmetic intensity – the ratio of arithmetic
operations to memory operations. Because the same program is executed for each
data element, there is a lower requirement for sophisticated flow control, and
because it is executed on many data elements and has high arithmetic intensity, the
memory access latency can be hidden with calculations instead of big data caches.
Data-parallel processing maps data elements to parallel processing threads. Many
applications that process large data sets can use a data-parallel programming model
to speed up the computations. In 3D rendering, large sets of pixels and vertices are
mapped to parallel threads. Similarly, image and media processing applications such
as post-processing of rendered images, video encoding and decoding, image scaling,
stereo vision, and pattern recognition can map image blocks and pixels to parallel
processing threads. In fact, many algorithms outside the field of image rendering
and processing are accelerated by data-parallel processing, from general signal
processing or physics simulation to computational finance or computational biology.
cooperate when solving each sub-problem, and at the same time enables automatic
scalability. Indeed, each block of threads can be scheduled on any of the available
processor cores, in any order, concurrently or sequentially, so that a compiled
CUDA program can execute on any number of processor cores as illustrated by
Figure 1-4, and only the runtime system needs to know the physical processor
count.
This scalable programming model allows the CUDA architecture to span a wide
market range by simply scaling the number of processors and memory partitions:
from the high-performance enthusiast GeForce GPUs and professional Quadro and
Tesla computing products to a variety of inexpensive, mainstream GeForce GPUs
(see Appendix A for a list of all CUDA-enabled GPUs).
Block 4 Block 5
Block 6 Block 7
A multithreaded program is partitioned into blocks of threads that execute independently from each
other, so that a GPU with more cores will automatically execute the program in less time than a GPU
with fewer cores.
This chapter introduces the main concepts behind the CUDA programming model
by outlining how they are exposed in C. An extensive description of CUDA C is
given in Section 3.2.
Full code for the vector addition example used in this chapter and the next can be
found in the vectorAdd SDK code sample.
2.1 Kernels
CUDA C extends C by allowing the programmer to define C functions, called
kernels, that, when called, are executed N times in parallel by N different CUDA
threads, as opposed to only once like regular C functions.
A kernel is defined using the __global__ declaration specifier and the number of
CUDA threads that execute that kernel for a given kernel call is specified using a
new <<<…>>> execution configuration syntax (see Appendix B.16). Each thread that
executes the kernel is given a unique thread ID that is accessible within the kernel
through the built-in threadIdx variable.
As an illustration, the following sample code adds two vectors A and B of size N
and stores the result into vector C:
// Kernel definition
__global__ void VecAdd(float* A, float* B, float* C)
{
int i = threadIdx.x;
C[i] = A[i] + B[i];
}
int main()
{
...
// Kernel invocation with N threads
VecAdd<<<1, N>>>(A, B, C);
}
Here, each of the N threads that execute VecAdd() performs one pair-wise
addition.
int main()
{
...
// Kernel invocation with one block of N * N * 1 threads
int numBlocks = 1;
dim3 threadsPerBlock(N, N);
MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
}
There is a limit to the number of threads per block, since all threads of a block are
expected to reside on the same processor core and must share the limited memory
resources of that core. On current GPUs, a thread block may contain up to 1024
threads.
However, a kernel can be executed by multiple equally-shaped thread blocks, so that
the total number of threads is equal to the number of threads per block times the
number of blocks.
Blocks are organized into a one-dimensional or two-dimensional grid of thread
blocks as illustrated by Figure 2-1. The number of thread blocks in a grid is usually
dictated by the size of the data being processed or the number of processors in the
system, which it can greatly exceed.
Grid
Block (1, 1)
The number of threads per block and the number of blocks per grid specified in the
<<<…>>> syntax can be of type int or dim3. Two-dimensional blocks or grids can
be specified as in the example above.
Each block within the grid can be identified by a one-dimensional or two-
dimensional index accessible within the kernel through the built-in blockIdx
variable. The dimension of the thread block is accessible within the kernel through
the built-in blockDim variable.
Extending the previous MatAdd() example to handle multiple blocks, the code
becomes as follows.
// Kernel definition
__global__ void MatAdd(float A[N][N], float B[N][N],
float C[N][N])
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i < N && j < N)
C[i][j] = A[i][j] + B[i][j];
int main()
{
...
// Kernel invocation
dim3 threadsPerBlock(16, 16);
dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);
MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
}
A thread block size of 16x16 (256 threads), although arbitrary in this case, is a
common choice. The grid is created with enough blocks to have one thread per
matrix element as before. For simplicity, this example assumes that the number of
threads per grid in each dimension is evenly divisible by the number of threads per
block in that dimension, although that need not be the case.
Thread blocks are required to execute independently: It must be possible to execute
them in any order, in parallel or in series. This independence requirement allows
thread blocks to be scheduled in any order across any number of cores as illustrated
by Figure 1-4, enabling programmers to write code that scales with the number of
cores.
Threads within a block can cooperate by sharing data through some shared memory
and by synchronizing their execution to coordinate memory accesses. More
precisely, one can specify synchronization points in the kernel by calling the
__syncthreads() intrinsic function; __syncthreads() acts as a barrier at
which all threads in the block must wait before any is allowed to proceed.
Section 3.2.2 gives an example of using shared memory.
For efficient cooperation, the shared memory is expected to be a low-latency
memory near each processor core (much like an L1 cache) and __syncthreads()
is expected to be lightweight.
Thread
Per-thread local
memory
Thread Block
Per-block shared
memory
Grid 0
Grid 1
Global memory
Block (0, 0) Block (1, 0)
The CUDA programming model also assumes that both the host and the device
maintain their own separate memory spaces in DRAM, referred to as host memory and
device memory, respectively. Therefore, a program manages the global, constant, and
texture memory spaces visible to kernels through calls to the CUDA runtime
(described in Chapter 3). This includes device memory allocation and deallocation as
well as data transfer between host and device memory.
C Program
Sequential
Execution
Device
Parallel kernel
Grid 1
Kernel1<<<>>>()
Serial code executes on the host while parallel code executes on the device.
Two interfaces are currently supported to write CUDA programs: CUDA C and the
CUDA driver API. An application typically uses either one or the other, but it can
use both as described in Section 3.4.
CUDA C exposes the CUDA programming model as a minimal set of extensions to
the C language. Any source file that contains some of these extensions must be
compiled with nvcc as outlined in Section 3.1. These extensions allow
programmers to define a kernel as a C function and use some new syntax to specify
the grid and block dimension each time the function is called.
The CUDA driver API is a lower-level C API that provides functions to load
kernels as modules of CUDA binary or assembly code, to inspect their parameters,
and to launch them. Binary and assembly codes are usually obtained by compiling
kernels written in C.
CUDA C comes with a runtime API and both the runtime API and the driver API
provide functions to allocate and deallocate device memory, transfer data between
host memory and device memory, manage systems with multiple devices, etc.
The runtime API is built on top of the CUDA driver API. Initialization, context,
and module management are all implicit and resulting code is more concise.
In contrast, the CUDA driver API requires more code, is harder to program and
debug, but offers a better level of control and is language-independent since it
handles binary or assembly code.
Section 3.2 continues the description of CUDA C started in Chapter 2. It also
introduces concepts that are common to both CUDA C and the driver API: linear
memory, CUDA arrays, shared memory, texture memory, page-locked host
memory, device enumeration, asynchronous execution, interoperability with
graphics APIs. Section 3.3 assumes knowledge of these concepts and describes how
they are exposed by the driver API.
3.2 CUDA C
CUDA C provides a simple path for users familiar with the C programming
language to easily write programs for execution by the device.
It consists of a minimal set of extensions to the C language and a runtime library.
The core language extensions have been introduced in Chapter 2. This section
continues with an introduction to the runtime. A complete description of all
extensions can be found in Appendix B and a complete description of the runtime
in the CUDA reference manual.
The runtime is implemented in the cudart dynamic library and all its entry points
are prefixed with cuda.
There is no explicit initialization function for the runtime; it initializes the first time
a runtime function is called (more specifically any function other than functions
from the device and version management sections of the reference manual). One
needs to keep this in mind when timing runtime function calls and when
interpreting the error code from the first call into the runtime.
Once the runtime has been initialized in a host thread, any resource (memory,
stream, event, etc.) allocated via some runtime function call in the host thread is
only valid within the context of the host thread. Therefore only runtime functions
calls made by the host thread (memory copies, kernel launches, …) can operate on
these resources. This is because a CUDA context (see Section 3.3.1) is created under
the hood as part of initialization and made current to the host thread, and it cannot
be made current to any other host thread.
On system with multiple devices, kernels are executed on device 0 by default as
detailed in Section 3.2.3.
// Host code
int main()
{
int N = ...;
size_t size = N * sizeof(float);
cudaMalloc(&d_C, size);
// Invoke kernel
int threadsPerBlock = 256;
int blocksPerGrid =
(N + threadsPerBlock – 1) / threadsPerBlock;
VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
// Device code
__global__ void MyKernel(float* devPtr,
size_t pitch, int width, int height)
{
for (int r = 0; r < height; ++r) {
float* row = (float*)((char*)devPtr + r * pitch);
for (int c = 0; c < width; ++c) {
float element = row[c];
}
}
}
The following code sample allocates a width×height×depth 3D array of
floating-point values and shows how to loop over the array elements in device code:
// Host code
int width = 64, height = 64, depth = 64;
cudaExtent extent = make_cudaExtent(width * sizeof(float),
height, depth);
cudaPitchedPtr devPitchedPtr;
cudaMalloc3D(&devPitchedPtr, extent);
MyKernel<<<100, 512>>>(devPitchedPtr, width, height, depth);
// Device code
__global__ void MyKernel(cudaPitchedPtr devPitchedPtr,
int width, int height, int depth)
{
char* devPtr = devPitchedPtr.ptr;
size_t pitch = devPitchedPtr.pitch;
size_t slicePitch = pitch * height;
for (int z = 0; z < depth; ++z) {
char* slice = devPtr + z * slicePitch;
for (int y = 0; y < height; ++y) {
float* row = (float*)(slice + y * pitch);
for (int x = 0; x < width; ++x) {
float element = row[x];
}
}
}
}
The reference manual lists all the various functions used to copy memory between
linear memory allocated with cudaMalloc(), linear memory allocated with
cudaMallocPitch() or cudaMalloc3D(), CUDA arrays, and memory
allocated for variables declared in global or constant memory space.
The following code sample illustrates various ways of accessing global variables via
the runtime API:
__constant__ float constData[256];
float data[256];
cudaMemcpyToSymbol(constData, data, sizeof(data));
cudaMemcpyFromSymbol(data, constData, sizeof(data));
// Invoke kernel
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);
MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);
B.width-1
0 col
B.height
0
A C
A.height
row
A.width B.width
A.height-1
By blocking the computation this way, we take advantage of fast shared memory
and save a lot of global memory bandwidth since A is only read (B.width / block_size)
times from global memory and B is read (A.height / block_size) times.
The Matrix type from the previous code sample is augmented with a stride field, so
that sub-matrices can be efficiently represented with the same type. __device__
functions (see Section B.1.1) are used to get and set elements and build any sub-
matrix from a matrix.
// Matrices are stored in row-major order:
// M(row, col) = *(M.elements + row * M.stride + col)
typedef struct {
int width;
int height;
int stride;
float* elements;
} Matrix;
// Invoke kernel
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);
MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);
blockCol
BLOCK_SIZE
B
B.height
BLOCK_SIZE
BLOCK_SIZE-1
A C
0 col
BLOCK_SIZE
Csub
blockRow
A.height
row
BLOCK_SIZE-1
A.width B.width
the return value of the texture fetch is interpolated based on where the texture
coordinates fell between the texels. Simple linear interpolation is performed for one-
dimensional textures and bilinear interpolation is performed for two-dimensional
textures.
Appendix F gives more details on texture fetching.
float u = x / (float)width;
float v = y / (float)height;
// Transform coordinates
u -= 0.5f;
v -= 0.5f;
float tu = u * cosf(theta) – v * sinf(theta) + 0.5f;
float tv = v * cosf(theta) + u * sinf(theta) + 0.5f;
// Host code
int main()
{
// Allocate CUDA array in device memory
cudaChannelFormatDesc channelDesc =
cudaCreateChannelDesc(32, 0, 0, 0,
cudaChannelFormatKindFloat);
cudaArray* cuArray;
cudaMallocArray(&cuArray, &channelDesc, width, height);
// Invoke kernel
dim3 dimBlock(16, 16);
dim3 dimGrid((width + dimBlock.x – 1) / dimBlock.x,
(height + dimBlock.y – 1) / dimBlock.y);
transformKernel<<<dimGrid, dimBlock>>>(output, width, height,
angle);
Unlike texture memory, surface memory uses byte addressing. This means that the
x-coordinate used to access a texture element via texture functions needs to be
multiplied by the byte size of the element to access the same element via a surface
function. For example, the element at texture coordinate x of a one-dimensional
floating-point CUDA array bound to a texture reference texRef and a surface
reference surfRef is read using tex1d(texRef, x) via texRef, but
surf1Dread(surfRef, 4*x) via surfRef. Similarly, the element at texture
coordinate x and y of a two-dimensional floating-point CUDA array bound to a
texture reference texRef and a surface reference surfRef is accessed using
tex2d(texRef, x, y) via texRef, but surf2Dread(surfRef, 4*x, y)
via surfRef (the byte offset of the y-coordinate is internally calculated from the
underlying line pitch of the CUDA array).
The following code sample applies some simple transformation kernel to a
// 2D surfaces
surface<void, 2> inputSurfRef;
surface<void, 2> outputSurfRef;
// Host code
int main()
{
// Allocate CUDA arrays in device memory
cudaChannelFormatDesc channelDesc =
cudaCreateChannelDesc(8, 8, 8, 8,
cudaChannelFormatKindUnsigned);
cudaArray* cuInputArray;
cudaMallocArray(&cuInputArray, &channelDesc, width, height,
cudaArraySurfaceLoadStore);
cudaArray* cuOutputArray;
cudaMallocArray(&cuOutputArray, &channelDesc, width, height,
cudaArraySurfaceLoadStore);
// Invoke kernel
dim3 dimBlock(16, 16);
dim3 dimGrid((width + dimBlock.x – 1) / dimBlock.x,
(height + dimBlock.y – 1) / dimBlock.y);
copyKernel<<<dimGrid, dimBlock>>>(width, height);
system for paging, allocating too much page-locked memory reduces overall system
performance.
The simple zero-copy SDK sample comes with a detailed document on the page-
locked memory APIs.
To be able to retrieve the device pointer to any mapped page-locked memory within
a given host thread, page-locked memory mapping must be enabled by calling
cudaSetDeviceFlags() with the cudaDeviceMapHost flag before any other
CUDA calls is performed by the thread. Otherwise,
cudaHostGetDevicePointer() will return an error.
cudaHostGetDevicePointer() also returns an error if the device does not
support mapped page-locked host memory.
Applications may query whether a device supports mapped page-locked host
memory or not by calling cudaGetDeviceProperties() and checking the
canMapHostMemory property.
Note that atomic functions (Section B.11) operating on mapped page-locked
memory are not atomic from the point of view of the host or other devices.
A kernel from one CUDA context cannot execute concurrently with a kernel from
another CUDA context.
Kernels that use many textures or a large amount of local memory are less likely to
execute concurrently with other kernels.
3.2.6.5 Stream
Applications manage concurrency through streams. A stream is a sequence of
commands that execute in order. Different streams, on the other hand, may execute
their commands out of order with respect to one another or concurrently; this
behavior is not guaranteed and should therefore not be relied upon for correctness
(e.g. inter-kernel communication is undefined).
3.2.6.6 Event
The runtime also provides a way to closely monitor the device‟s progress, as well as
perform accurate timing, by letting the application asynchronously record events at
any point in the program and query when these events are completed. An event has
completed when all tasks – or optionally, all commands in a given stream –
preceding the event have completed. Events in stream zero are completed after all
preceding task and commands in all streams are completed.
The following code sample creates two events:
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
These events can be used to time the code sample of the previous section the
following way:
cudaEventRecord(start, 0);
for (int i = 0; i < 2; ++i) {
cudaMemcpyAsync(inputDev + i * size, inputHost + i * size,
size, cudaMemcpyHostToDevice, stream[i]);
MyKernel<<<100, 512, 0, stream[i]>>>
(outputDev + i * size, inputDev + i * size, size);
cudaMemcpyAsync(outputHost + i * size, outputDev + i * size,
size, cudaMemcpyDeviceToHost, stream[i]);
}
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime, start, stop);
They are destroyed this way:
cudaEventDestroy(start);
cudaEventDestroy(stop);
int main()
{
// Explicitly set device
cudaGLSetGLDevice(0);
void display()
{
// Map buffer object for writing from CUDA
float4* positions;
cudaGraphicsMapResources(1, &positionsVBO_CUDA, 0);
size_t num_bytes;
cudaGraphicsResourceGetMappedPointer((void**)&positions,
&num_bytes,
positionsVBO_CUDA));
// Execute kernel
dim3 dimBlock(16, 16, 1);
dim3 dimGrid(width / dimBlock.x, height / dimBlock.y, 1);
createVertices<<<dimGrid, dimBlock>>>(positions, time,
width, height);
// Swap buffers
glutSwapBuffers();
glutPostRedisplay();
}
void deleteVBO()
{
cudaGraphicsUnregisterResource(positionsVBO_CUDA);
glDeleteBuffers(1, &positionsVBO);
}
// Calculate uv coordinates
float u = x / (float)width;
float v = y / (float)height;
u = u * 2.0f - 1.0f;
v = v * 2.0f - 1.0f;
// Write positions
positions[y * width + x] = make_float4(u, w, v, 1.0f);
}
On Windows and for Quadro GPUs, cudaWGLGetDevice() can be used to
retrieve the CUDA device associated to the handle returned by
wglEnumGpusNV(). Quadro GPUs offer higher performance OpenGL
interoperability than GeForce and Tesla GPUs in a multi-GPU configuration where
OpenGL rendering is performed on the Quadro GPU and CUDA computations are
performed on other GPUs in the system.
IDirect3DDevice9* device;
struct CUSTOMVERTEX {
FLOAT x, y, z;
DWORD color;
};
IDirect3DVertexBuffer9* positionsVB;
struct cudaGraphicsResource* positionsVB_CUDA;
int main()
{
// Initialize Direct3D
D3D = Direct3DCreate9(D3D_SDK_VERSION);
// Create device
...
D3D->CreateDevice(adapter, D3DDEVTYPE_HAL, hWnd,
D3DCREATE_HARDWARE_VERTEXPROCESSING,
¶ms, &device);
void Render()
{
// Map vertex buffer for writing from CUDA
float4* positions;
cudaGraphicsMapResources(1, &positionsVB_CUDA, 0);
size_t num_bytes;
cudaGraphicsResourceGetMappedPointer((void**)&positions,
&num_bytes,
positionsVB_CUDA));
// Execute kernel
dim3 dimBlock(16, 16, 1);
dim3 dimGrid(width / dimBlock.x, height / dimBlock.y, 1);
createVertices<<<dimGrid, dimBlock>>>(positions, time,
width, height);
void releaseVB()
{
cudaGraphicsUnregisterResource(positionsVB_CUDA);
positionsVB->Release();
}
// Calculate uv coordinates
float u = x / (float)width;
float v = y / (float)height;
u = u * 2.0f - 1.0f;
v = v * 2.0f - 1.0f;
// Write positions
positions[y * width + x] =
make_float4(u, w, v, __int_as_float(0xff00ff00));
}
Direct3D 10 Version:
ID3D10Device* device;
struct CUSTOMVERTEX {
FLOAT x, y, z;
DWORD color;
};
ID3D10Buffer* positionsVB;
struct cudaGraphicsResource* positionsVB_CUDA;
int main()
{
// Get a CUDA-enabled adapter
IDXGIFactory* factory;
CreateDXGIFactory(__uuidof(IDXGIFactory), (void**)&factory);
IDXGIAdapter* adapter = 0;
for (unsigned int i = 0; !adapter; ++i) {
if (FAILED(factory->EnumAdapters(i, &adapter))
break;
int dev;
if (cudaD3D10GetDevice(&dev, adapter) == cudaSuccess)
break;
adapter->Release();
}
factory->Release();
void Render()
{
// Map vertex buffer for writing from CUDA
float4* positions;
cudaGraphicsMapResources(1, &positionsVB_CUDA, 0);
size_t num_bytes;
cudaGraphicsResourceGetMappedPointer((void**)&positions,
&num_bytes,
positionsVB_CUDA));
// Execute kernel
dim3 dimBlock(16, 16, 1);
dim3 dimGrid(width / dimBlock.x, height / dimBlock.y, 1);
createVertices<<<dimGrid, dimBlock>>>(positions, time,
width, height);
void releaseVB()
{
cudaGraphicsUnregisterResource(positionsVB_CUDA);
positionsVB->Release();
}
// Calculate uv coordinates
float u = x / (float)width;
float v = y / (float)height;
u = u * 2.0f - 1.0f;
v = v * 2.0f - 1.0f;
// Write positions
positions[y * width + x] =
make_float4(u, w, v, __int_as_float(0xff00ff00));
}
Direct3D 11 Version:
ID3D11Device* device;
struct CUSTOMVERTEX {
FLOAT x, y, z;
DWORD color;
};
ID3D11Buffer* positionsVB;
struct cudaGraphicsResource* positionsVB_CUDA;
int main()
{
// Get a CUDA-enabled adapter
IDXGIFactory* factory;
CreateDXGIFactory(__uuidof(IDXGIFactory), (void**)&factory);
IDXGIAdapter* adapter = 0;
for (unsigned int i = 0; !adapter; ++i) {
if (FAILED(factory->EnumAdapters(i, &adapter))
break;
int dev;
if (cudaD3D11GetDevice(&dev, adapter) == cudaSuccess)
break;
adapter->Release();
}
factory->Release();
void Render()
{
// Map vertex buffer for writing from CUDA
float4* positions;
cudaGraphicsMapResources(1, &positionsVB_CUDA, 0);
size_t num_bytes;
cudaGraphicsResourceGetMappedPointer((void**)&positions,
&num_bytes,
positionsVB_CUDA));
// Execute kernel
dim3 dimBlock(16, 16, 1);
dim3 dimGrid(width / dimBlock.x, height / dimBlock.y, 1);
createVertices<<<dimGrid, dimBlock>>>(positions, time,
width, height);
void releaseVB()
{
cudaGraphicsUnregisterResource(positionsVB_CUDA);
positionsVB->Release();
}
// Calculate uv coordinates
float u = x / (float)width;
float v = y / (float)height;
u = u * 2.0f - 1.0f;
v = v * 2.0f - 1.0f;
// Write positions
positions[y * width + x] =
make_float4(u, w, v, __int_as_float(0xff00ff00));
}
The only way to check for asynchronous errors just after some asynchronous
function call is therefore to synchronize just after the call by calling
cudaThreadSynchronize() (or by using any other synchronization
mechanisms described in Section 3.2.6) and checking the error code returned by
cudaThreadSynchronize().
The runtime maintains an error variable for each host thread that is initialized to
cudaSuccess and is overwritten by the error code every time an error occurs (be
it a parameter validation error or an asynchronous error).
cudaPeekAtLastError() returns this variable. cudaGetLastError() returns
this variable and resets it to cudaSuccess.
Kernel launches do not return any error code, so cudaPeekAtLastError() or
cudaGetLastError() must be called just after the kernel launch to retrieve any
pre-launch errors. To ensure that any error returned by
cudaPeekAtLastError() or cudaGetLastError() does not originate from
calls prior to the kernel launch, one has to make sure that the runtime error variable
is set to cudaSuccess just before the kernel launch, for example, by calling
cudaGetLastError() just before the kernel launch. Kernel launches are
asynchronous, so to check for asynchronous errors, the application must
synchronize in-between the kernel launch and the call to
cudaPeekAtLastError() or cudaGetLastError().
Note that cudaErrorNotReady that may be returned by cudaStreamQuery()
and cudaEventQuery() is not considered an error and is therefore not reported
by cudaPeekAtLastError() or cudaGetLastError().
The driver API is implemented in the nvcuda dynamic library and all its entry
points are prefixed with cu.
The driver API must be initialized with cuInit() before any function from the
driver API is called. A CUDA context must then be created that is attached to a
specific device and made current to the calling host thread as detailed in
Section 3.3.1.
Within a CUDA context, kernels are explicitly loaded as PTX or binary objects by
the host code as described in Section 3.3.2. Kernels written in C must therefore be
compiled separately into PTX or binary objects. Kernels are launched using API
entry points as described in Section 3.3.3.
Any application that wants to run on future device architectures must load PTX, not
binary code. This is because binary code is architecture-specific and therefore
incompatible with future architectures, whereas PTX code is compiled to binary
code at load time by the driver.
Here is the host code of the sample from Section 2.1 written using the driver API:
int main()
{
int N = ...;
size_t size = N * sizeof(float);
// Initialize
cuInit(0);
// Create context
CUcontext cuContext;
cuCtxCreate(&cuContext, 0, cuDevice);
// Invoke kernel
#define ALIGN_UP(offset, alignment) \
(offset) = ((offset) + (alignment) – 1) & ~((alignment) – 1)
int offset = 0;
ALIGN_UP(offset, __alignof(d_A));
cuParamSetv(vecAdd, offset, &d_A, sizeof(d_A));
offset += sizeof(d_A);
ALIGN_UP(offset, __alignof(d_B));
cuParamSetv(vecAdd, offset, &d_B, sizeof(d_B));
offset += sizeof(d_B);
ALIGN_UP(offset, __alignof(d_C));
cuParamSetv(vecAdd, offset, &d_C, sizeof(d_C));
offset += sizeof(d_C);
ALIGN_UP(offset, __alignof(N));
cuParamSeti(vecAdd, offset, N);
offset += sizeof(N);
cuParamSetSize(vecAdd, offset);
int threadsPerBlock = 256;
int blocksPerGrid =
(N + threadsPerBlock – 1) / threadsPerBlock;
cuFuncSetBlockShape(vecAdd, threadsPerBlock, 1, 1);
cuLaunchGrid(vecAdd, blocksPerGrid, 1);
...
}
Full code can be found in the vectorAddDrv SDK code sample.
3.3.1 Context
A CUDA context is analogous to a CPU process. All resources and actions
performed within the driver API are encapsulated inside a CUDA context, and the
system automatically cleans up these resources when the context is destroyed.
Besides objects such as modules and texture or surface references, each context has
its own distinct 32-bit address space. As a result, CUdeviceptr values from
different contexts reference different memory locations.
A host thread may have only one device context current at a time. When a context is
created with cuCtxCreate(), it is made current to the calling host thread. CUDA
functions that operate in a context (most functions that do not involve device
enumeration or context management) will return
CUDA_ERROR_INVALID_CONTEXT if a valid context is not current to the thread.
Each host thread has a stack of current contexts. cuCtxCreate() pushes the new
context onto the top of the stack. cuCtxPopCurrent() may be called to detach
the context from the host thread. The context is then "floating" and may be pushed
as the current context for any host thread. cuCtxPopCurrent() also restores the
previous current context, if any.
A usage count is also maintained for each context. cuCtxCreate() creates a
context with a usage count of 1. cuCtxAttach() increments the usage count and
cuCtxDetach() decrements it. A context is destroyed when the usage count goes
to 0 when calling cuCtxDetach() or cuCtxDestroy().
Usage count facilitates interoperability between third party authored code operating
in the same context. For example, if three libraries are loaded to use the same
context, each library would call cuCtxAttach() to increment the usage count and
cuCtxDetach() to decrement the usage count when the library is done using the
context. For most libraries, it is expected that the application will have created a
context before loading or initializing the library; that way, the application can create
the context using its own heuristics, and the library simply operates on the context
handed to it. Libraries that wish to create their own contexts – unbeknownst to their
API clients who may or may not have created contexts of their own – would use
cuCtxPushCurrent() and cuCtxPopCurrent() as illustrated in Figure 3-3.
Initialize
cuCtxCreate() context cuCtxPopCurrent()
Library Call
Use
cuCtxPushCurrent() cuCtxPopCurrent()
context
3.3.2 Module
Modules are dynamically loadable packages of device code and data, akin to DLLs in
Windows, that are output by nvcc (see Section 3.1). The names for all symbols,
including functions, global variables, and texture or surface references, are
int i;
ALIGN_UP(offset, __alignof(i));
cuParamSeti(cuFunction, offset, i);
offset += sizeof(i);
float4 f4;
ALIGN_UP(offset, 16); // float4‟s alignment is 16
cuParamSetv(cuFunction, offset, &f4, sizeof(f4));
offset += sizeof(f4);
char c;
ALIGN_UP(offset, __alignof(c));
cuParamSeti(cuFunction, offset, c);
offset += sizeof(c);
float f;
ALIGN_UP(offset, __alignof(f));
cuParamSetf(cuFunction, offset, f);
offset += sizeof(f);
CUdeviceptr dptr;
ALIGN_UP(offset, __alignof(dptr));
cuParamSetv(cuFunction, offset, &dptr, sizeof(dptr));
offset += sizeof(dptr);
float2 f2;
ALIGN_UP(offset, 8); // float2‟s alignment is 8
cuParamSetv(cuFunction, offset, &f2, sizeof(f2));
offset += sizeof(f2);
cuParamSetSize(cuFunction, offset);
// Create context
CUcontext cuContext;
cuCtxCreate(&cuContext, 0, cuDevice);
// Invoke kernel
#define ALIGN_UP(offset, alignment) \
(offset) = ((offset) + (alignment) – 1) & ~((alignment) – 1)
int offset = 0;
ALIGN_UP(offset, __alignof(d_A));
cuParamSetv(vecAdd, offset, &d_A, sizeof(d_A));
offset += sizeof(d_A);
ALIGN_UP(offset, __alignof(d_B));
cuParamSetv(vecAdd, offset, &d_B, sizeof(d_B));
offset += sizeof(d_B);
ALIGN_UP(offset, __alignof(d_C));
cuParamSetv(vecAdd, offset, &d_C, sizeof(d_C));
offset += sizeof(d_C);
cuParamSetSize(VecAdd, offset);
int threadsPerBlock = 256;
int blocksPerGrid =
(N + threadsPerBlock – 1) / threadsPerBlock;
cuFuncSetBlockShape(vecAdd, threadsPerBlock, 1, 1);
cuLaunchGrid(VecAdd, blocksPerGrid, 1);
// Device code
__global__ void MyKernel(float* devPtr)
{
for (int r = 0; r < height; ++r) {
float* row = (float*)((char*)devPtr + r * pitch);
for (int c = 0; c < width; ++c) {
float element = row[c];
}
}
}
The following code sample allocates a width×height CUDA array of one 32-bit
floating-point component:
CUDA_ARRAY_DESCRIPTOR desc;
desc.Format = CU_AD_FORMAT_FLOAT;
desc.NumChannels = 1;
desc.Width = width;
desc.Height = height;
CUarray cuArray;
cuArrayCreate(&cuArray, &desc);
The reference manual lists all the various functions used to copy memory between
linear memory allocated with cuMemAlloc(), linear memory allocated with
cuMemAllocPitch(), and CUDA arrays.
The following code sample copies the 2D array to the CUDA array allocated in the
previous code samples:
CUDA_MEMCPY2D copyParam;
memset(©Param, 0, sizeof(copyParam));
copyParam.dstMemoryType = CU_MEMORYTYPE_ARRAY;
copyParam.dstArray = cuArray;
copyParam.srcMemoryType = CU_MEMORYTYPE_DEVICE;
copyParam.srcDevice = devPtr;
copyParam.srcPitch = pitch;
copyParam.WidthInBytes = width * sizeof(float);
copyParam.Height = height;
cuMemcpy2D(©Param);
The following code sample illustrates various ways of accessing global variables via
the driver API:
CUdeviceptr devPtr;
size_t bytes;
Matrix d_B;
d_B.width = d_B.stride = B.width; d_B.height = B.height;
size = B.width * B.height * sizeof(float);
cuMemAlloc(elements, size);
cuMemcpyHtoD(elements, B.elements, size);
d_B.elements = (float*)elements;
The following code sample is the driver version of the host code of the sample from
Section 3.2.4.1.3.
// Host code
int main()
{
// Allocate CUDA array in device memory
CUarray cuArray;
CUDA_ARRAY_DESCRIPTOR desc;
desc.Format = CU_AD_FORMAT_FLOAT;
desc.NumChannels = 1;
desc.Width = width;
desc.Height = height;
cuArrayCreate(&cuArray, &desc);
ALIGN_UP(offset, __alignof(height));
cuParamSeti(transformKernel, offset, height);
offset += sizeof(height);
ALIGN_UP(offset, __alignof(angle));
cuParamSetf(transformKernel, offset, angle);
offset += sizeof(angle);
cuParamSetSize(transformKernel, offset));
cuFuncSetBlockShape(transformKernel, 16, 16, 1);
cuLaunchGrid(transformKernel,
(width + dimBlock.x – 1) / dimBlock.x,
(height + dimBlock.y – 1) / dimBlock.y);
cuMemcpy2D(©Param);
3.3.9.1 Stream
The driver API provides functions similar to the runtime API to manage streams.
The following code sample is the driver version of the code sample from
Section 3.2.6.4.
CUstream stream[2];
for (int i = 0; i < 2; ++i)
cuStreamCreate(&stream[i], 0);
float* hostPtr;
cuMemAllocHost(&hostPtr, 2 * size);
cuEventRecord(start, 0);
for (int i = 0; i < 2; ++i)
cuMemcpyHtoDAsync(inputDevPtr + i * size, hostPtr + i * size,
size, stream[i]);
for (int i = 0; i < 2; ++i) {
#define ALIGN_UP(offset, alignment) \
(offset) = ((offset) + (alignment) – 1) & ~((alignment) – 1)
int offset = 0;
ALIGN_UP(offset, __alignof(outputDevPtr));
cuParamSetv(cuFunction, offset,
&outputDevPtr, sizeof(outputDevPtr));
offset += sizeof(outputDevPtr);
ALIGN_UP(offset, __alignof(inputDevPtr));
cuParamSetv(cuFunction, offset,
&inputDevPtr, sizeof(inputDevPtr));
offset += sizeof(inputDevPtr);
ALIGN_UP(offset, __alignof(size));
cuParamSeti(cuFunction, offset, size);
offset += sizeof(size);
cuParamSetSize(cuFunction, offset);
cuFuncSetBlockShape(cuFunction, 512, 1, 1);
cuLaunchGridAsync(cuFunction, 100, 1, stream[i]);
}
for (int i = 0; i < 2; ++i)
cuMemcpyDtoHAsync(hostPtr + i * size, outputDevPtr + i * size,
size, stream[i]);
cuEventRecord(stop, 0);
cuEventSynchronize(stop);
float elapsedTime;
cuEventElapsedTime(&elapsedTime, start, stop);
They are destroyed this way:
cuEventDestroy(start);
cuEventDestroy(stop);
int main()
{
// Initialize driver API
...
// Create context
CUcontext cuContext;
cuGLCtxCreate(&cuContext, 0, cuDevice);
void display()
{
// Map OpenGL buffer object for writing from CUDA
CUdeviceptr positions;
cuGraphicsMapResources(1, &positionsVBO_CUDA, 0);
size_t num_bytes;
cuGraphicsResourceGetMappedPointer((void**)&positions,
&num_bytes,
positionsVBO_CUDA));
// Execute kernel
#define ALIGN_UP(offset, alignment) \
(offset) = ((offset) + (alignment) – 1) & ~((alignment) – 1)
int offset = 0;
ALIGN_UP(offset, __alignof(positions));
cuParamSetv(createVertices, offset,
&positions, sizeof(positions));
offset += sizeof(positions);
ALIGN_UP(offset, __alignof(time));
cuParamSetf(createVertices, offset, time);
offset += sizeof(time);
ALIGN_UP(offset, __alignof(width));
cuParamSeti(createVertices, offset, width);
offset += sizeof(width);
ALIGN_UP(offset, __alignof(height));
cuParamSeti(createVertices, offset, height);
offset += sizeof(height);
cuParamSetSize(createVertices, offset);
int threadsPerBlock = 16;
cuFuncSetBlockShape(createVertices,
threadsPerBlock, threadsPerBlock, 1);
cuLaunchGrid(createVertices,
width / threadsPerBlock, height / threadsPerBlock);
// Swap buffers
glutSwapBuffers();
glutPostRedisplay();
}
void deleteVBO()
{
cuGraphicsUnregisterResource(positionsVBO_CUDA);
glDeleteBuffers(1, &positionsVBO);
}
On Windows and for Quadro GPUs, cuWGLGetDevice() can be used to retrieve
the CUDA device associated to the handle returned by wglEnumGpusNV().
The following code sample is the driver version of the host code of the sample from
Section 3.2.7.2.
Direct3D 9 Version:
IDirect3D9* D3D;
IDirect3DDevice9 device;
struct CUSTOMVERTEX {
FLOAT x, y, z;
DWORD color;
};
IDirect3DVertexBuffer9* positionsVB;
struct cudaGraphicsResource* positionsVB_CUDA;
int main()
{
// Initialize Direct3D
D3D = Direct3DCreate9(D3D_SDK_VERSION);
// Create device
...
D3D->CreateDevice(adapter, D3DDEVTYPE_HAL, hWnd,
D3DCREATE_HARDWARE_VERTEXPROCESSING,
¶ms, &device);
// Create context
CUdevice cuDevice;
CUcontext cuContext;
cuD3D9CtxCreate(&cuContext, &cuDevice, 0, device);
positionsVB,
cudaGraphicsRegisterFlagsNone);
cuGraphicsResourceSetMapFlags(positionsVB_CUDA,
cudaGraphicsMapFlagsWriteDiscard);
void Render()
{
// Map vertex buffer for writing from CUDA
float4* positions;
cuGraphicsMapResources(1, &positionsVB_CUDA, 0);
size_t num_bytes;
cuGraphicsResourceGetMappedPointer((void**)&positions,
&num_bytes,
positionsVB_CUDA));
// Execute kernel
#define ALIGN_UP(offset, alignment) \
(offset) = ((offset) + (alignment) – 1) & ~((alignment) – 1)
int offset = 0;
ALIGN_UP(offset, __alignof(positions));
cuParamSetv(createVertices, offset,
&positions, sizeof(positions));
offset += sizeof(positions);
ALIGN_UP(offset, __alignof(time));
cuParamSetf(createVertices, offset, time);
offset += sizeof(time);
ALIGN_UP(offset, __alignof(width));
cuParamSeti(createVertices, offset, width);
offset += sizeof(width);
ALIGN_UP(offset, __alignof(height));
cuParamSeti(createVertices, offset, height);
offset += sizeof(height);
cuParamSetSize(createVertices, offset);
int threadsPerBlock = 16;
cuFuncSetBlockShape(createVertices,
threadsPerBlock, threadsPerBlock, 1);
cuLaunchGrid(createVertices,
width / threadsPerBlock, height / threadsPerBlock);
void releaseVB()
{
cuGraphicsUnregisterResource(positionsVB_CUDA);
positionsVB->Release();
}
Direct3D 10 Version:
ID3D10Device* device;
struct CUSTOMVERTEX {
FLOAT x, y, z;
DWORD color;
};
ID3D10Buffer* positionsVB;
struct cudaGraphicsResource* positionsVB_CUDA;
int main()
{
// Get a CUDA-enabled adapter
IDXGIFactory* factory;
CreateDXGIFactory(__uuidof(IDXGIFactory), (void**)&factory);
IDXGIAdapter* adapter = 0;
for (unsigned int i = 0; !adapter; ++i) {
if (FAILED(factory->EnumAdapters(i, &adapter))
break;
int dev;
if (cuD3D10GetDevice(&dev, adapter) == cudaSuccess)
break;
adapter->Release();
}
factory->Release();
// Create context
CUdevice cuDevice;
CUcontext cuContext;
cuD3D10CtxCreate(&cuContext, &cuDevice, 0, device);
bufferDesc.Usage = D3D10_USAGE_DEFAULT;
bufferDesc.ByteWidth = size;
bufferDesc.BindFlags = D3D10_BIND_VERTEX_BUFFER;
bufferDesc.CPUAccessFlags = 0;
bufferDesc.MiscFlags = 0;
device->CreateBuffer(&bufferDesc, 0, &positionsVB);
cuGraphicsD3D10RegisterResource(&positionsVB_CUDA,
positionsVB,
cudaGraphicsRegisterFlagsNone);
cuGraphicsResourceSetMapFlags(positionsVB_CUDA,
cudaGraphicsMapFlagsWriteDiscard);
void Render()
{
// Map vertex buffer for writing from CUDA
float4* positions;
cuGraphicsMapResources(1, &positionsVB_CUDA, 0);
size_t num_bytes;
cuGraphicsResourceGetMappedPointer((void**)&positions,
&num_bytes,
positionsVB_CUDA));
// Execute kernel
#define ALIGN_UP(offset, alignment) \
(offset) = ((offset) + (alignment) – 1) & ~((alignment) – 1)
int offset = 0;
ALIGN_UP(offset, __alignof(positions));
cuParamSetv(createVertices, offset,
&positions, sizeof(positions));
offset += sizeof(positions);
ALIGN_UP(offset, __alignof(time));
cuParamSetf(createVertices, offset, time);
offset += sizeof(time);
ALIGN_UP(offset, __alignof(width));
cuParamSeti(createVertices, offset, width);
offset += sizeof(width);
ALIGN_UP(offset, __alignof(height));
cuParamSeti(createVertices, offset, height);
offset += sizeof(height);
cuParamSetSize(createVertices, offset);
int threadsPerBlock = 16;
cuFuncSetBlockShape(createVertices,
threadsPerBlock, threadsPerBlock, 1);
cuLaunchGrid(createVertices,
width / threadsPerBlock, height / threadsPerBlock);
void releaseVB()
{
cuGraphicsUnregisterResource(positionsVB_CUDA);
positionsVB->Release();
}
Direct3D 11 Version:
ID3D11Device* device;
struct CUSTOMVERTEX {
FLOAT x, y, z;
DWORD color;
};
ID3D11Buffer* positionsVB;
struct cudaGraphicsResource* positionsVB_CUDA;
int main()
{
// Get a CUDA-enabled adapter
IDXGIFactory* factory;
CreateDXGIFactory(__uuidof(IDXGIFactory), (void**)&factory);
IDXGIAdapter* adapter = 0;
for (unsigned int i = 0; !adapter; ++i) {
if (FAILED(factory->EnumAdapters(i, &adapter))
break;
int dev;
if (cuD3D11GetDevice(&dev, adapter) == cudaSuccess)
break;
adapter->Release();
}
factory->Release();
// Create context
CUdevice cuDevice;
CUcontext cuContext;
cuD3D11CtxCreate(&cuContext, &cuDevice, 0, device);
void Render()
{
// Map vertex buffer for writing from CUDA
float4* positions;
cuGraphicsMapResources(1, &positionsVB_CUDA, 0);
size_t num_bytes;
cuGraphicsResourceGetMappedPointer((void**)&positions,
&num_bytes,
positionsVB_CUDA));
// Execute kernel
#define ALIGN_UP(offset, alignment) \
(offset) = ((offset) + (alignment) – 1) & ~((alignment) – 1)
int offset = 0;
ALIGN_UP(offset, __alignof(positions));
cuParamSetv(createVertices, offset,
&positions, sizeof(positions));
offset += sizeof(positions);
ALIGN_UP(offset, __alignof(time));
cuParamSetf(createVertices, offset, time);
offset += sizeof(time);
ALIGN_UP(offset, __alignof(width));
cuParamSeti(createVertices, offset, width);
offset += sizeof(width);
ALIGN_UP(offset, __alignof(height));
cuParamSeti(createVertices, offset, height);
offset += sizeof(height);
cuParamSetSize(createVertices, offset);
int threadsPerBlock = 16;
cuFuncSetBlockShape(createVertices,
threadsPerBlock, threadsPerBlock, 1);
cuLaunchGrid(createVertices,
width / threadsPerBlock, height / threadsPerBlock);
void releaseVB()
{
cuGraphicsUnregisterResource(positionsVB_CUDA);
positionsVB->Release();
}
a mode switch of the display by changing the resolution or bit depth of the display
(using NVIDIA control panel or the Display control panel on Windows), the
amount of memory needed for the primary surface changes. For example, if the user
changes the display resolution from 1280x1024x32-bit to 1600x1200x32-bit, the
system must dedicate 7.68 MB to the primary surface rather than 5.24 MB. (Full-
screen graphics applications running with anti-aliasing enabled may require much
more display memory for the primary surface.) On Windows, other events that may
initiate display mode switches include launching a full-screen DirectX application,
hitting Alt+Tab to task switch away from a full-screen DirectX application, or
hitting Ctrl+Alt+Del to lock the computer.
If a mode switch increases the amount of memory needed for the primary surface,
the system may have to cannibalize memory allocations dedicated to CUDA
applications. Therefore, a mode switch results in any call to the CUDA runtime to
fail and return an invalid context error.
are a function of the compute capability of the device and are given in Appendix G.
If there are not enough registers or shared memory available per multiprocessor to
process at least one block, the kernel will fail to launch.
The total number of warps Wblock in a block is as follows:
T
Wblock ceil ( ,1)
Wsize
they should use __syncthreads() and share data through shared memory within
the same kernel invocation, or they belong to different blocks, in which case they
must share data through global memory using two separate kernel invocations, one
for writing to and one for reading from global memory. The second case is much
less optimal since it adds the overhead of extra kernel invocations and global
memory traffic. Its occurrence should therefore be minimized by mapping the
algorithm to the CUDA programming model in such a way that the computations
that require inter-thread communication are performed within a single thread block
as much as possible.
For devices of compute capability 2.0, the two instructions issued every other cycle
are for two different warps. For devices of compute capability 2.1, the four
instructions issued every other cycle are two pairs for two different warps, each pair
being for the same warp.
The most common reason a warp is not ready to execute its next instruction is that
the instruction‟s input operands are not yet available.
If all input operands are registers, latency is caused by register dependencies, i.e.
some of the input operands are written by some previous instruction(s) whose
execution has not completed yet. In the case of a back-to-back register dependency
(i.e. some input operand is written by the previous instruction), the latency is equal
to the execution time of the previous instruction and the warp scheduler must
schedule instructions for different warps during that time. Execution time varies
depending on the instruction, but it is typically about 22 clock cycles, which
translates to 6 warps for devices of compute capability 1.x and 22 warps for devices
of compute capability 2.x.
If some input operand resides in off-chip memory, the latency is much higher: 400
to 800 clock cycles. The number of warps required to keep the warp schedulers busy
during such high latency periods depends on the kernel code; in general, more warps
are required if the ratio of the number of instructions with no off-chip memory
operands (i.e. arithmetic instructions most of the time) to the number of
instructions with off-chip memory operands is low (this ratio is commonly called
the arithmetic intensity of the program). If this ratio is 15, for example, then to hide
latencies of about 600 clock cycles, about 10 warps are required for devices of
compute capability 1.x and about 40 for devices of compute capability 2.x.
Another reason a warp is not ready to execute its next instruction is that it is waiting
at some memory fence (Section B.5) or synchronization point (Section B.6). A
synchronization point can force the multiprocessor to idle as more and more warps
wait for other warps in the same block to complete execution of instructions prior
to the synchronization point. Having multiple resident blocks per multiprocessor
can help reduce idling in this case, as warps from different blocks do not need to
wait for each other at synchronization points.
The number of blocks and warps residing on each multiprocessor for a given kernel
call depends on the execution configuration of the call (Section B.16), the memory
resources of the multiprocessor, and the resource requirements of the kernel as
described in Section 4.2. To assist programmers in choosing thread block size based
on register and shared memory requirements, the CUDA Software Development
Kit provides a spreadsheet, called the CUDA Occupancy Calculator, where
occupancy is defined as the ratio of the number of resident warps to the maximum
number of resident warps (given in Appendix G for various compute capabilities).
Register, local, shared, and constant memory usages are reported by the compiler
when compiling with the --ptxas-options=-v option.
The total amount of shared memory required for a block is equal to the sum of the
amount of statically allocated shared memory, the amount of dynamically allocated
shared memory, and for devices of compute capability 1.x, the amount of shared
memory used to pass the kernel‟s arguments (see Section B.1.4).
The number of registers used by a kernel can have a significant impact on the
number of resident warps. For example, for devices of compute capability 1.2, if a
kernel uses 16 registers and each block has 512 threads and requires very little
shared memory, then two blocks (i.e. 32 warps) can reside on the multiprocessor
since they require 2x512x16 registers, which exactly matches the number of registers
available on the multiprocessor. But as soon as the kernel uses one more register,
only one block (i.e. 16 warps) can be resident since two blocks would require
2x512x17 registers, which are more registers than are available on the
multiprocessor. Therefore, the compiler attempts to minimize register usage while
keeping register spilling (see Section 5.3.2.2) and the number of instructions to a
minimum. Register usage can be controlled using the -maxrregcount compiler
option or launch bounds as described in Section B.17.
Each double variable (on devices that supports native double precision, i.e. devices
of compute capability 1.2 and higher) and each long long variable uses two
registers. However, devices of compute capability 1.2 and higher have at least twice
as many registers per multiprocessor as devices with lower compute capability.
The effect of execution configuration on performance for a given kernel call
generally depends on the kernel code. Experimentation is therefore recommended.
Applications can also parameterize execution configurations based on register file
size and shared memory size, which depends on the compute capability of the
device, as well as on the number of multiprocessors and memory bandwidth of the
device, all of which can be queried using the runtime or driver API (see reference
manual).
The number of threads per block should be chosen as a multiple of the warp size to
avoid wasting computing resources with under-populated warps as much as
possible.
distribution of the memory addresses across the threads within the warp. How the
distribution affects the instruction throughput this way is specific to each type of
memory and described in the following sections. For example, for global memory,
as a general rule, the more scattered the addresses are, the more reduced the
throughput is.
Arrays for which it cannot determine that they are indexed with constant
quantities,
Large structures or arrays that would consume too much register space,
Any variable if the kernel uses more registers than available (this is also known
as register spilling).
Inspection of the PTX assembly code (obtained by compiling with the –ptx or
-keep option) will tell if a variable has been placed in local memory during the first
compilation phases as it will be declared using the .local mnemonic and accessed
using the ld.local and st.local mnemonics. Even if it has not, subsequent
compilation phases might still decide otherwise though if they find it consumes too
much register space for the targeted architecture: Inspection of the cubin object
using cuobjdump will tell if this is the case. Also, the compiler reports total local
memory usage per kernel (lmem) when compiling with the --ptxas-options=-v
option. Note that some mathematical functions have implementation paths that
might access local memory.
The local memory space resides in device memory, so local memory accesses have
same high latency and low bandwidth as global memory accesses and are subject to
the same requirements for memory coalescing as described in Section 5.3.2.1. Local
memory is however organized such that consecutive 32-bit words are accessed by
consecutive thread IDs. Accesses are therefore fully coalesced as long as all threads
in a warp access the same relative address (e.g. same index in an array variable, same
member in a structure variable).
On devices of compute capability 2.x, local memory accesses are always cached in
L1 and L2 in the same way as global memory accesses (see Section G.4.2).
For devices of compute capability 1.x, a constant memory request for a warp is first
split into two requests, one for each half-warp, that are issued independently.
A request is then split into as many separate requests as there are different memory
addresses in the initial request, decreasing throughput by a factor equal to the
number of separate requests.
The resulting requests are then serviced at the throughput of the constant cache in
case of a cache hit, or at the throughput of device memory otherwise.
All throughputs are for one multiprocessor. They must be multiplied by the number
of multiprocessors in the device to get throughput for the whole device.
32-bit floating-point
8 32 48
add, multiply, multiply-add
64-bit floating-point
1 16 4
add, multiply, multiply-add
32-bit integer
8 32 48
add, logical operation
32-bit integer
8 16 16
shift, compare
32-bit integer
Multiple
multiply, multiply-add, sum of 16 16
instructions
absolute difference
24-bit integer multiply Multiple Multiple
8
(__[u]mul24) instructions instructions
32-bit floating-point
reciprocal, reciprocal square
root,
base-2 logarithm (__log2f), 2 4 8
base-2 exponential (exp2f),
sine (__sinf), cosine
(__cosf)
Type conversions 8 16 16
Other instructions and functions are implemented on top of the native instructions.
The implementation may be different for devices of compute capability 1.x and
devices of compute capability 2.x, and the number of native instructions after
compilation may fluctuate with every compiler version. For complicated functions,
there can be multiple code paths depending on input. cuobjdump can be used to
inspect a particular implementation in a cubin object.
The implementation of some functions are readily available on the CUDA header
files (math_functions.h, device_functions.h, …).
In general, code compiled with -ftz=true (denormalized numbers are flushed to
zero) tends to have higher performance than code compiled with -ftz=false.
Similarly, code compiled with -prec-div=false (less precise division) tends to
have higher performance code than code compiled with -prec-div=true, and
code compiled with -prec-sqrt=false (less precise square root) tends to have
higher performance than code compiled with -prec-sqrt=true. The nvcc user
manual describes these compilation flags in more details.
Single-Precision Floating-Point Addition and Multiplication Intrinsics
__fadd_r[d,u], __fmul_r[d,u], and __fmaf_r[n,z,d,u] (see
Section C.2.1) compile to tens of instructions for devices of compute capability 1.x,
but map to a single native instruction for devices of compute capability 2.x.
Single-Precision Floating-Point Division
__fdividef(x, y) (see Section C.2.1) provides faster single-precision floating-
point division than the division operator.
Single-Precision Floating-Point Reciprocal Square Root
To preserve IEEE-754 semantics the compiler can optimize 1.0/sqrtf() into
rsqrtf() only when both reciprocal and square root are approximate, (i.e. with
-prec-div=false and -prec-sqrt=false). It is therefore recommended to
invoke rsqrtf() directly where desired.
Single-Precision Floating-Point Square Root
Single-precision floating-point square root is implemented as a reciprocal square
root followed by a reciprocal instead of a reciprocal square root followed by a
multiplication so that it gives correct results for 0 and infinity. Therefore, its
throughput is 1 operation per clock cycle for devices of compute capability 1.x and
2 operations per clock cycle for devices of compute capability 2.x.
Sine and Cosine
sinf(x), cosf(x), tanf(x), sincosf(x), and corresponding double-
precision instructions are much more expensive and even more so if the argument x
is large in magnitude.
More precisely, the argument reduction code (see math_functions.h for
implementation) comprises two code paths referred to as the fast path and the slow
path, respectively.
The fast path is used for arguments sufficiently small in magnitude and essentially
consists of a few multiply-add operations. The slow path is used for arguments large
in magnitude and consists of lengthy computations required to achieve correct
results over the entire argument range.
At present, the argument reduction code for the trigonometric functions selects the
fast path for arguments whose magnitude is less than 48039.0f for the single-
precision functions, and less than 2147483648.0 for the double-precision functions.
As the slow path requires more registers than the fast path, an attempt has been
made to reduce register pressure in the slow path by storing some intermediate
variables in local memory, which may affect performance because of local memory
high latency and bandwidth (see Section 5.3.2.2). At present, 28 bytes of local
memory are used by single-precision functions, and 44 bytes are used by double-
precision functions. However, the exact amount is subject to change.
Due to the lengthy computations and use of local memory in the slow path, the
throughput of these trigonometric functions is lower by one order of magnitude
when the slow path reduction is required as opposed to the fast path reduction.
Integer Arithmetic
On devices of compute capability 1.x, 32-bit integer multiplication is implemented
using multiple instructions as it is not natively supported. 24-bit integer
multiplication is natively supported however via the __[u]mul24 intrinsic (see
Section C.2.3). Using __[u]mul24 instead of the 32-bit multiplication operator
whenever possible usually improves performance for instruction bound kernels. It
can have the opposite effect however in cases where the use of __[u]mul24
inhibits compiler optimizations.
On devices of compute capability 2.x, 32-bit integer multiplication is natively
supported, but 24-bit integer multiplication is not. __[u]mul24 is therefore
implemented using multiple instructions and should not be used.
Integer division and modulo operation are costly: tens of instructions on devices of
compute capability 1.x, below 20 instructions on devices of compute capability 2.x.
They can be replaced with bitwise operations in some cases: If n is a power of 2,
(i/n) is equivalent to (i>>log2(n)) and (i%n) is equivalent to (i&(n-1));
the compiler will perform these conversions if n is literal.
__brev, __brevll, __popc, and __popcll (see Section C.2.3) compile to tens
of instructions for devices of compute capability 1.x, but __brev and __popc map
to a single instruction for devices of compute capability 2.x and __brevll and
__popcll to just a few.
__clz, __clzll, __ffs, and __ffsll (see Section C.2.3) compile to fewer
instructions for devices of compute capability 2.x than for devices of compute
capability 1.x.
Type Conversion
Sometimes, the compiler must insert conversion instructions, introducing additional
execution cycles. This is the case for:
Functions operating on variables of type char or short whose operands
generally need to be converted to int,
Double-precision floating-point constants (i.e. those constants defined without
any type suffix) used as input to single-precision floating-point computations (as
mandated by C/C++ standards).
This last case can be avoided by using single-precision floating-point constants,
defined with an f suffix such as 3.141592653589793f, 1.0f, 0.5f.
To obtain best performance in cases where the control flow depends on the thread
ID, the controlling condition should be written so as to minimize the number of
divergent warps. This is possible because the distribution of the warps across the
block is deterministic as mentioned in Section 4.1. A trivial example is when the
controlling condition only depends on (threadIdx / warpSize) where
warpSize is the warp size. In this case, no warp diverges since the controlling
condition is perfectly aligned with the warps.
Sometimes, the compiler may unroll loops or it may optimize out if or switch
statements by using branch predication instead, as detailed below. In these cases, no
warp can ever diverge. The programmer can also control loop unrolling using the
#pragma unroll directive (see Section E.2).
When using branch predication none of the instructions whose execution depends
on the controlling condition gets skipped. Instead, each of them is associated with a
per-thread condition code or predicate that is set to true or false based on the
controlling condition and although each of these instructions gets scheduled for
execution, only the instructions with a true predicate are actually executed.
Instructions with a false predicate do not write results, and also do not evaluate
addresses or read operands.
The compiler replaces a branch instruction with predicated instructions only if the
number of instructions controlled by the branch condition is less or equal to a
certain threshold: If the compiler determines that the condition is likely to produce
many divergent warps, this threshold is 7, otherwise it is 4.
Table A-1 lists all CUDA-enabled devices with their compute capability, number of
multiprocessors, and number of CUDA cores.
These, as well as the clock frequency and the total amount of device memory, can
be queried using the runtime or driver API (see reference manual).
B.1.1 __device__
The __device__ qualifier declares a function that is:
Executed on the device
Callable from the device only.
In device code compiled for devices of compute capability 1.x, a __device__
function is always inlined by default. The __noinline__ function qualifier
however can be used as a hint for the compiler not to inline the function if possible
(see Section E.1).
B.1.2 __global__
The __global__ qualifier declares a function as being a kernel. Such a function is:
Executed on the device,
Callable from the host only.
__global__ functions must have void return type.
Any call to a __global__ function must specify its execution configuration as
described in Section B.16.
A call to a __global__ function is asynchronous, meaning it returns before the
device has completed its execution.
B.1.3 __host__
The __host__ qualifier declares a function that is:
B.1.4 Restrictions
B.1.4.1 Functions Parameters
__global__ function parameters are passed to the device:
via shared memory and are limited to 256 bytes on devices of compute
capability 1.x,
via constant memory and are limited to 4 KB on devices of compute
capability 2.x.
B.1.4.5 Recursion
__global__ functions do not support recursion.
__device__ functions only support recursion in device code compiled for devices
of compute capability 2.x.
B.2.1 __device__
The __device__ qualifier declares a variable that resides on the device.
At most one of the other type qualifiers defined in the next three sections may be
used together with __device__ to further specify which memory space the
variable belongs to. If none of them is present, the variable:
Resides in global memory space,
Has the lifetime of an application,
Is accessible from all the threads within the grid and from the host through the
runtime library (cudaGetSymbolAddress() / cudaGetSymbolSize() /
cudaMemcpyToSymbol() / cudaMemcpyFromSymbol() for the runtime
API and cuModuleGetGlobal() for the driver API).
B.2.2 __constant__
The __constant__ qualifier, optionally used together with __device__,
declares a variable that:
Resides in constant memory space,
Has the lifetime of an application,
Is accessible from all the threads within the grid and from the host through the
runtime library (cudaGetSymbolAddress() / cudaGetSymbolSize() /
cudaMemcpyToSymbol() / cudaMemcpyFromSymbol() for the runtime
API and cuModuleGetGlobal() for the driver API).
B.2.3 __shared__
The __shared__ qualifier, optionally used together with __device__, declares a
variable that:
Resides in the shared memory space of a thread block,
Has the lifetime of the block,
Is only accessible from all the threads within the block.
When declaring a variable in shared memory as an external array such as
extern __shared__ float shared[];
the size of the array is determined at launch time (see Section B.16). All variables
declared in this fashion, start at the same address in memory, so that the layout of
the variables in the array must be explicitly managed through offsets. For example, if
one wants the equivalent of
short array0[128];
float array1[64];
int array2[256];
in dynamically allocated shared memory, one could declare and initialize the arrays
the following way:
extern __shared__ float array[];
__device__ void func() // __device__ or __global__ function
{
short* array0 = (short*)array;
float* array1 = (float*)&array0[128];
int* array2 = (int*)&array1[64];
}
Note that pointers need to be aligned to the type they point to, so the following
code, for example, does not work since array1 is not aligned to 4 bytes.
extern __shared__ float array[];
__device__ void func() // __device__ or __global__ function
{
short* array0 = (short*)array;
float* array1 = (float*)&array0[127];
}
Alignment requirements for the built-in vector types are listed in Table B-1.
B.2.4 Restrictions
The __device__, __shared__ and __constant__ qualifiers are not allowed
on struct and union members, on formal parameters and on local variables
within a function that executes on the host.
B.2.4.2 Assignment
__constant__ variables cannot be assigned to from the device, only from the
host through host runtime functions (Sections 3.2.1 and 3.3.4).
__shared__ variables cannot have an initialization as part of their declaration.
B.2.4.4 Pointers
For devices of compute capability 1.x, pointers in code that is executed on the
device are supported as long as the compiler is able to resolve whether they point to
either the shared memory space or the global memory space, otherwise they are
restricted to only point to memory allocated or declared in the global memory space.
For devices of compute capability 2.x, pointers are supported without any
restriction.
Dereferencing a pointer either to global or shared memory in code that is executed
on the host or to host memory in code that is executed on the device results in an
undefined behavior, most often in a segmentation fault and application termination.
The address obtained by taking the address of a __device__, __shared__ or
__constant__ variable can only be used in device code. The address of a
__device__ or __constant__ variable obtained through
cudaGetSymbolAddress() as described in Section 3.3.4 can only be used in
host code.
B.2.5 volatile
Only after the execution of a __threadfence_block(), __threadfence(),
or __syncthreads() (Sections B.5 and B.6) are prior writes to global or shared
memory guaranteed to be visible by other threads. As long as this requirement is
met, the compiler is free to optimize reads and writes to global or shared memory.
For example, in the code sample below, the first reference to myArray[tid]
compiles into a global or shared memory read instruction, but the second reference
does not as the compiler simply reuses the result of the first read.
// myArray is an array of non-zero integers
// located in global or shared memory
__global__ void MyKernel(int* result) {
int tid = threadIdx.x;
int ref1 = myArray[tid] * 1;
myArray[tid + 1] = 2;
int ref2 = myArray[tid] * 1;
result[tid] = ref1 * ref2;
}
Therefore, ref2 cannot possibly be equal to 2 in thread tid as a result of thread
tid-1 overwriting myArray[tid] by 2.
This behavior can be changed using the volatile keyword: If a variable located in
global or shared memory is declared as volatile, the compiler assumes that its value
can be changed at any time by another thread and therefore any reference to this
variable compiles to an actual memory read instruction.
Note that even if myArray is declared as volatile in the code sample above, there is
no guarantee, in general, that ref2 will be equal to 2 in thread tid since thread
tid might read myArray[tid] into ref2 before thread tid-1 overwrites its
value by 2. Synchronization is required as mentioned in Section 5.4.3.
longlong1, ulonglong1 8
longlong2, ulonglong2 16
float1 4
float2 8
float3 4
float4 16
double1 8
double2 16
B.3.2 dim3
This type is an integer vector type based on uint3 that is used to specify
dimensions. When defining a variable of type dim3, any component left unspecified
is initialized to 1.
B.4.1 gridDim
This variable is of type dim3 (see Section B.3.2) and contains the dimensions of the
grid.
B.4.2 blockIdx
This variable is of type uint3 (see Section B.3.1) and contains the block index
within the grid.
B.4.3 blockDim
This variable is of type dim3 (see Section B.3.2) and contains the dimensions of the
block.
B.4.4 threadIdx
This variable is of type uint3 (see Section B.3.1) and contains the thread index
within the block.
B.4.5 warpSize
This variable is of type int and contains the warp size in threads (see Section 4.1
for the definition of a warp).
B.4.6 Restrictions
It is not allowed to take the address of any of the built-in variables.
It is not allowed to assign values to any of the built-in variables.
the last block start reading partial sums before they have been actually updated in
memory.
__device__ unsigned int count = 0;
__shared__ bool isLastBlockDone;
__global__ void sum(const float* array, unsigned int N,
float* result)
{
// Each block sums a subset of the input array
float partialSum = calculatePartialSum(array, N);
if (threadIdx.x == 0) {
if (isLastBlockDone) {
if (threadIdx.x == 0) {
waits until all threads in the thread block have reached this point and all global and
shared memory accesses made by these threads prior to __syncthreads() are
visible to all threads in the block.
__syncthreads() is used to coordinate communication between the threads of
the same block. When some threads within a block access the same addresses in
shared or global memory, there are potential read-after-write, write-after-read, or
write-after-write hazards for some of these memory accesses. These data hazards
can be avoided by synchronizing threads in-between these accesses.
__syncthreads() is allowed in conditional code but only if the conditional
evaluates identically across the entire thread block, otherwise the code execution is
likely to hang or produce unintended side effects.
Devices of compute capability 2.x support three variations of __syncthreads()
described below.
int __syncthreads_count(int predicate);
is identical to __syncthreads() with the additional feature that it evaluates
predicate for all threads of the block and returns the number of threads for
which predicate evaluates to non-zero.
int __syncthreads_and(int predicate);
is identical to __syncthreads() with the additional feature that it evaluates
predicate for all threads of the block and returns non-zero if and only if
predicate evaluates to non-zero for all of them.
int __syncthreads_or(int predicate);
is identical to __syncthreads() with the additional feature that it evaluates
predicate for all threads of the block and returns non-zero if and only if
predicate evaluates to non-zero for any of them.
B.8.1 tex1Dfetch()
template<class Type>
Type tex1Dfetch(
texture<Type, 1, cudaReadModeElementType> texRef,
int x);
float tex1Dfetch(
texture<unsigned char, 1, cudaReadModeNormalizedFloat> texRef,
int x);
float tex1Dfetch(
texture<signed char, 1, cudaReadModeNormalizedFloat> texRef,
int x);
float tex1Dfetch(
texture<unsigned short, 1, cudaReadModeNormalizedFloat> texRef,
int x);
float tex1Dfetch(
texture<signed short, 1, cudaReadModeNormalizedFloat> texRef,
int x);
fetch the region of linear memory bound to texture reference texRef using integer
texture coordinate x. No texture filtering and addressing modes are supported. For
integer types, these functions may optionally promote the integer to single-precision
floating point.
Besides the functions shown above, 2-, and 4-tuples are supported; for example:
float4 tex1Dfetch(
texture<uchar4, 1, cudaReadModeNormalizedFloat> texRef,
int x);
fetches the region of linear memory bound to texture reference texRef using
texture coordinate x.
B.8.2 tex1D()
template<class Type, enum cudaTextureReadMode readMode>
Type tex1D(texture<Type, 1, readMode> texRef,
float x);
fetches the CUDA array bound to texture reference texRef using texture
coordinate x.
B.8.3 tex2D()
template<class Type, enum cudaTextureReadMode readMode>
Type tex2D(texture<Type, 2, readMode> texRef,
float x, float y);
fetches the CUDA array or the region of linear memory bound to texture reference
texRef using texture coordinates x and y.
B.8.4 tex3D()
template<class Type, enum cudaTextureReadMode readMode>
Type tex3D(texture<Type, 3, readMode> texRef,
float x, float y, float z);
fetches the CUDA array bound to texture reference texRef using texture
coordinates x, y, and z.
B.9.1 surf1Dread()
template<class Type>
Type surf1Dread(surface<void, 1> surfRef, int x,
boundaryMode = cudaBoundaryModeTrap);
reads the CUDA array bound to surface reference surfRef using coordinate x.
B.9.2 surf1Dwrite()
template<class Type>
void surf1Dwrite(Type data, surface<void, 1> surfRef, int x,
boundaryMode = cudaBoundaryModeTrap);
writes value data to the CUDA array bound to surface reference surfRef at
coordinate x.
B.9.3 surf2Dread()
template<class Type>
Type surf2Dread(surface<void, 2> surfRef,
int x, int y,
boundaryMode = cudaBoundaryModeTrap);
reads the CUDA array bound to surface reference surfRef using coordinates x
and y.
B.9.4 surf2Dwrite()
template<class Type>
void surf2Dwrite(Type data, surface<void, 2> surfRef,
int x, int y,
boundaryMode = cudaBoundaryModeTrap);
writes value data to the CUDA array bound to surface reference surfRef at
coordinate x and y.
the same address. These three operations are performed in one atomic transaction.
The function returns old.
The floating-point version of atomicAdd() is only supported by devices of
compute capability 2.x.
B.11.1.2 atomicSub()
int atomicSub(int* address, int val);
unsigned int atomicSub(unsigned int* address,
unsigned int val);
reads the 32-bit word old located at the address address in global or shared
memory, computes (old - val), and stores the result back to memory at the
same address. These three operations are performed in one atomic transaction. The
function returns old.
B.11.1.3 atomicExch()
int atomicExch(int* address, int val);
unsigned int atomicExch(unsigned int* address,
unsigned int val);
unsigned long long int atomicExch(unsigned long long int* address,
unsigned long long int val);
float atomicExch(float* address, float val);
reads the 32-bit or 64-bit word old located at the address address in global or
shared memory and stores val back to memory at the same address. These two
operations are performed in one atomic transaction. The function returns old.
B.11.1.4 atomicMin()
int atomicMin(int* address, int val);
unsigned int atomicMin(unsigned int* address,
unsigned int val);
reads the 32-bit word old located at the address address in global or shared
memory, computes the minimum of old and val, and stores the result back to
memory at the same address. These three operations are performed in one atomic
transaction. The function returns old.
B.11.1.5 atomicMax()
int atomicMax(int* address, int val);
unsigned int atomicMax(unsigned int* address,
unsigned int val);
reads the 32-bit word old located at the address address in global or shared
memory, computes the maximum of old and val, and stores the result back to
memory at the same address. These three operations are performed in one atomic
transaction. The function returns old.
B.11.1.6 atomicInc()
unsigned int atomicInc(unsigned int* address,
unsigned int val);
reads the 32-bit word old located at the address address in global or shared
memory, computes ((old >= val) ? 0 : (old+1)), and stores the result
back to memory at the same address. These three operations are performed in one
atomic transaction. The function returns old.
B.11.1.7 atomicDec()
unsigned int atomicDec(unsigned int* address,
unsigned int val);
reads the 32-bit word old located at the address address in global or shared
memory, computes (((old == 0) | (old > val)) ? val : (old-1)),
and stores the result back to memory at the same address. These three operations
are performed in one atomic transaction. The function returns old.
B.11.1.8 atomicCAS()
int atomicCAS(int* address, int compare, int val);
unsigned int atomicCAS(unsigned int* address,
unsigned int compare,
unsigned int val);
unsigned long long int atomicCAS(unsigned long long int* address,
unsigned long long int compare,
unsigned long long int val);
reads the 32-bit or 64-bit word old located at the address address in global or
shared memory, computes (old == compare ? val : old), and stores the
result back to memory at the same address. These three operations are performed in
one atomic transaction. The function returns old (Compare And Swap).
B.11.2.2 atomicOr()
int atomicOr(int* address, int val);
unsigned int atomicOr(unsigned int* address,
unsigned int val);
reads the 32-bit word old located at the address address in global or shared
memory, computes (old | val), and stores the result back to memory at the
same address. These three operations are performed in one atomic transaction. The
function returns old.
B.11.2.3 atomicXor()
int atomicXor(int* address, int val);
unsigned int atomicXor(unsigned int* address,
unsigned int val);
reads the 32-bit word old located at the address address in global or shared
memory, computes (old ^ val), and stores the result back to memory at the
same address. These three operations are performed in one atomic transaction. The
function returns old.
B.14.2 Limitations
Final formatting of the printf() output takes place on the host system. This
means that the format string must be understood by the host-system‟s compiler and
C library. Every effort has been made to ensure that the format specifiers supported
by CUDA‟s printf function form a universal subset from the most common host
compilers, but exact behavior will be host-O/S-dependent.
As described in Section B.14.1, printf() will accept all combinations of valid flags
and types. This is because it cannot determine what will and will not be valid on the
host system where the final output is formatted. The effect of this is that output
may be undefined if the program emits a format string which contains invalid
combinations.
The output buffer for printf() is set to a fixed size before kernel launch (see
below). This buffer is circular, and is flushed at any host-side synchronisation point
and at when the context is explicitly destroyed; if more output is produced during
kernel execution than can fit in the buffer, older output is overwritten.
The printf() command can accept at most 32 arguments in addition to the
format string. Additional arguments beyond this will be ignored, and the format
specifier output as-is.
Owing to the differing size of the long type on 64-bit Windows platforms (four
bytes on 64-bit Windows platforms, eight bytes on other 64-bit platforms), a kernel
which is compiled on a non-Windows 64-bit machine but then run on a win64
machine will see corrupted output for all format strings which include “%ld”. It is
recommended that the compilation platform matches the execution platform to
ensure safety.
The output buffer for printf() is not flushed automatically to the output stream,
but instead is flushed only when one of these actions is performed:
Kernel launch via <<<>>> or cuLaunch(),
Synchronization via cudaThreadSynchronize(),
cuCtxSynchronize(), cudaStreamSynchronize(), or
cuStreamSynchronize(),
Module loading/unloading via cuModuleLoad() or cuModuleUnload(),
Context destruction via cudaThreadExit() or cuCtxDestroy().
Note that the buffer is not flushed automatically when the program exits. The user
must call cudaThreadExit() or cuCtxDestroy() explicitly, as shown in the
examples below.
B.14.4 Examples
The following code sample:
__global__ void helloCUDA(float f)
{
printf(“Hello thread %d, f=%f\n”, threadIdx.x, f) ;
}
void main()
{
helloCUDA<<<1, 5>>>(1.2345f);
cudaThreadExit();
}
will output:
Hello thread 0, f=1.2345
Hello thread 1, f=1.2345
Hello thread 2, f=1.2345
Hello thread 3, f=1.2345
Hello thread 4, f=1.2345
Notice how each thread encounters the printf() command, so there are as many
lines of output as there were threads launched in the grid. As expected, global values
(i.e. float f) are common between all threads, and local values (i.e.
threadIdx.x) are distinct per-thread.
The following code sample:
__global__ void helloCUDA(float f)
{
if (threadIdx.x == 0)
printf(“Hello thread %d, f=%f\n”, threadIdx.x, f) ;
}
void main()
{
helloCUDA<<<1, 5>>>(1.2345f);
cudaThreadExit();
}
will output:
Hello thread 0, f=1.2345
Self-evidently, the if() statement limits which threads will call printf, so that
only a single line of output is seen.
B.15.3 Examples
B.15.3.1 Per Thread Allocation
The following code sample:
__global__ void mallocTest()
{
char* ptr = (char*)malloc(123);
printf(“Thread %d got pointer: %p\n”, threadIdx.x, ptr);
free(ptr);
void main()
{
// Set a heap size of 128 megabytes. Note that this must
// be done before any kernel is launched.
cudaThreadSetLimit(cudaLimitMallocHeapSize, 128*1024*1024);
mallocTest<<<1, 5>>>();
cudaThreadSynchronize();
}
will output:
Thread 0 got pointer: 00057020
Thread 1 got pointer: 0005708c
Thread 2 got pointer: 000570f8
Thread 3 got pointer: 00057164
Thread 4 got pointer: 000571d0
Notice how each thread encounters the malloc() command and so receives its
own allocation. (Exact pointer values will vary: these are illustrative.)
void main()
{
cudaThreadSetLimit(cudaLimitMallocHeapSize, 128*1024*1024);
mallocTest<<<10, 128>>>();
cudaThreadSynchronize();
void main()
{
cudaThreadSetLimit(cudaLimitMallocHeapSize, 128*1024*1024);
// Allocate memory
allocmem<<< NUM_BLOCKS, 10 >>>();
// Use memory
usemem<<< NUM_BLOCKS, 10 >>>();
usemem<<< NUM_BLOCKS, 10 >>>();
usemem<<< NUM_BLOCKS, 10 >>>();
// Free memory
cudaThreadSynchronize();
}
Optimal launch bounds for a given kernel will usually differ across major
architecture revisions. The sample code below shows how this is typically handled in
device code using the __CUDA_ARCH__ macro introduced in Section 3.1.4.
#define THREADS_PER_BLOCK 256
#if __CUDA_ARCH__ >= 200
#define MY_KERNEL_MAX_THREADS (2 * THREADS_PER_BLOCK)
#define MY_KERNEL_MIN_BLOCKS 3
#else
#define MY_KERNEL_MAX_THREADS THREADS_PER_BLOCK
#define MY_KERNEL_MIN_BLOCKS 2
#endif
// Device code
__global__ void
__launch_bounds__(MY_KERNEL_MAX_THREADS, MY_KERNEL_MIN_BLOCKS)
MyKernel(...)
{
...
}
In the common case where MyKernel is invoked with the maximum number of
threads per block (specified as the first parameter of __launch_bounds__()), it
is tempting to use MY_KERNEL_MAX_THREADS as the number of threads per block
in the execution configuration:
// Host code
MyKernel<<<blocksPerGrid, MY_KERNEL_MAX_THREADS>>>(...);
This will not work however since __CUDA_ARCH__ is undefined in host code as
mentioned in Section 3.1.4, so MyKernel will launch with 256 threads per block
even when __CUDA_ARCH__ is greater or equal to 200. Instead the number of
threads per block should be determined:
Either at compile time using a macro that does not depend on
__CUDA_ARCH__, for example
// Host code
MyKernel<<<blocksPerGrid, THREADS_PER_BLOCK>>>(...);
Or at runtime based on the compute capability
// Host code
cudaGetDeviceProperties(&deviceProp, device);
int threadsPerBlock =
(deviceProp.major >= 2 ?
2 * THREADS_PER_BLOCK : THREADS_PER_BLOCK);
MyKernel<<<blocksPerGrid, threadsPerBlock>>>(...);
Register usage is reported by the --ptxas-options=-v compiler option. The
number of resident blocks can be derived from the occupancy reported by the
CUDA profiler (see Section 5.2.3 for a definition of occupancy).
Register usage can also be controlled for all __global__ functions in a file using
the -maxrregcount compiler option. The value of -maxrregcount is ignored
for functions with launch bounds.
Functions from Section C.1 can be used in both host and device code whereas
functions from Section C.2 can only be used in device code.
Note that floating-point functions are overloaded, so that in general, there are three
prototypes for a given function <func-name>:
(1) double <func-name>(double), e.g. double log(double)
(2) float <func-name>(float), e.g. float log(float)
(3) float <func-name>f(float), e.g. float logf(float)
This means, in particular, that passing a float argument always results in a float
result (variants (2) and (3) above).
__popc(x) returns the number of bits that are set to 1 in the binary representation
of 32-bit integer parameter x.
__popcll(x) returns the number of bits that are set to 1 in the binary
representation of 64-bit integer parameter x.
__brev(x) reverses the bits of 32-bit unsigned integer parameter x, i.e. bit N of
the result corresponds to bit 31-N of x.
__brevll(x) reverses the bits of 64-bit unsigned long long parameter x, i.e. bit N
of the result corresponds to bit 63-N of x.
__byte_perm(x,y,s) returns, as a 32-bit integer r, four bytes from eight input
bytes provided in the two input integers x and y. The input bytes are indexed as
follows:
input[0] = x<0:7> input[1] = x<8:15>
input[2] = x<16:23> input[3] = x<24:31>
input[4] = y<0:7> input[5] = y<8:15>
input[6] = y<16:23> input[7] = y<24:31>
The selector indices are stored in 4-bit nibbles (with the upper 16-bit of the selector
not being used):
selector[0] = s<0:3> selector[1] = s<4:7>
selector[2] = s<8:11> selector[3] = s<12:15>
The returned value r is computed to be:
result[n] := input[selector[n]]
where result[n] is the nth byte of r.
__ll2float_[rn,rz,ru,rd](x)
__ull2float_[rn,rz,ru,rd](x)
__float2half_rn(x)
__half2float(x)
__double2float_[rn,rz,ru,rd](x)
__double2int_[rn,rz,ru,rd](x)
__double2uint_[rn,rz,ru,rd](x)
__double2ll_[rn,rz,ru,rd](x)
__double2ull_[rn,rz,ru,rd](x)
__int2double_rn(x)
__uint2double_rn(x)
__ll2double_[rn,rz,ru,rd](x)
__ull2double_[rn,rz,ru,rd](x)
CUDA supports the following C++ language constructs for device code:
Polymorphism
Default Parameters
Operator Overloading
Namespaces
Function Templates
Classes for devices of compute capability 2.x
These C++ constructs are implemented as specified in “The C++ Programming
Langue” reference. It is valid to use any of these constructs in .cu CUDA files for
host, device, and kernel (__global__) functions. Any restrictions detailed in previous
parts of this programming guide, like the lack of support for recursion, still apply.
The following subsections provide examples of the various constructs.
D.1 Polymorphism
Generally, polymorphism is the ability to define that functions or operators behave
differently in different contexts. This is also referred to as function (and operator,
see below) overloading.
In practical terms, this means that it is permissible to define two different functions
within the same scope (namespace) as long as they have a distinguishable function
signature. That means that the two functions either consume a different number of
parameters or parameters of different types. When either of the multiple functions
gets invoked the compiler resolves to the function‟s implementation that matches
the function signature.
Because of implicit typecasting, a compiler may encounter multiple potential
matches for a function invocation and in that case the matching rules as described in
the C++ Language Standard apply. In practice this means that the compiler will pick
the closest match in case of multiple potential matches.
Example: The following is valid CUDA code:
__device__ void f(float x)
{
// do something with x
}
Default parameters can only be given for the last n parameters of a function.
c = a + b;
D.4 Namespaces
Namespaces in C++ allow for the creation of a hierarchy of scopes of visibility. All
the symbols inside a namespace can be used within this namespaces without
additional syntax.
The use of namespaces can be used to solve the problem of name-clashes (two
different symbols using identical names), which commonly occurs when using
multiple function libraries from different sources.
Example: The following code defines two functions “f()” in two separate
namespaces (“nvidia” and “other”):
namespace nvidia {
__device__ void f(float x)
{ /* do something with x */ ;}
}
namespace other {
__device__ void f(float x)
{ /* do something with x */ ;}
}
The functions can now be used anywhere via fully qualified names:
nvidia::f(0.5f);
All the symbols in a namespace can be imported into another namespace (scope)
like this:
using namespace nvidia;
f(0.5f);
template <>
__device__ bool
f<int>(T x)
{ return true; }
In this case the implementation for T representing the int type are specialized to
return true, all other types will be caught by the more general template and return
false.
The complete set of matching rules (for implicitly deducing template parameters)
and matching polymorphous functions apply as specified in the C++ standard.
D.6 Classes
Code compiled for devices with compute capability 2.x and higher may make use of
C++ classes, as long as none of the member functions are virtual (this restriction
will be removed in some future release).
There are two common use cases for classes without virtual member functions:
Small-data aggregations. E.g. data types like pixels (r, g, b, a), 2D and 3D points,
vectors, etc.
Functor classes. The use of functors is necessitated by the fact that device-
function pointers are not supported and thus it is not possible to pass functions
as template parameters. A workaround for this restriction is the use of functor
classes (see code sample below).
__device__
PixelRGBA(unsigned char r, unsigned char g, unsigned char b,
unsigned char a = 255): r_(r), g_(g), b_(b), a_(a)
{ ; }
private:
unsigned char r_, g_, b_, a_;
__device__
PixelRGBA operator+(const PixelRGBA & p1, const PixelRGBA & p2)
{
return PixelRGBA(p1.r_ + p2.r_,
p1.g_ + p2.g_,
p1.b_ + p2.b_,
p1.a_ + p2.a_);
}
Other device code can now make use of this new data type as one would expect:
PixelRGBA p1, p2;
PixelRGBA p3 = p1 + p2;
class Sub
{
public:
__device__
float
E.3 __restrict__
nvcc supports restricted pointers via the __restrict__ keyword.
Restricted pointers were introduced in C99 to alleviate the aliasing problem that
exists in C-type languages, and which inhibits all kind of optimization from code re-
ordering to common sub-expression elimination.
Here is an example subject to the aliasing issue, where use of restricted pointer can
help the compiler to reduce the number of instructions:
void foo(const float* a,
const float* b,
float* c)
{
c[0] = a[0] * b[0];
c[1] = a[0] * b[0];
c[2] = a[0] * b[0] * a[1];
c[3] = a[0] * a[1];
c[4] = a[0] * b[0];
c[5] = b[0];
...
}
In C-type languages, the pointers a, b, and c may be aliased, so any write through c
could modify elements of a or b. This means that to guarantee functional
correctness, the compiler cannot load a[0] and b[0] into registers, multiply them,
and store the result to both c[0] and c[1], because the results would differ from
the abstract execution model if, say, a[0] is really the same location as c[0]. So
the compiler cannot take advantage of the common sub-expression. Likewise,
the compiler cannot just reorder the computation of c[4] into the proximity of the
computation of c[0] and c[1] because the preceding write to c[3] could change
the inputs to the computation of c[4].
By making a, b, and c restricted pointers, the programmer asserts to the compiler
that the pointers are in fact not aliased, which in this case means writes through c
would never overwrite elements of a or b. This changes the function prototype as
follows:
void foo(const float* __restrict__ a,
const float* __restrict__ b,
float* __restrict__ c);
Note that all pointer arguments need to be made restricted for the compiler
optimizer to derive any benefit. With the __restrict keywords added, the
compiler can now reorder and do common sub-expression elimination at will, while
retaining functionality identical with the abstract execution model:
void foo(const float* __restrict__ a,
const float* __restrict__ b,
float* __restrict__ c)
{
float t0 = a[0];
float t1 = b[0];
float t2 = t0 * t2;
float t3 = a[1];
c[0] = t2;
c[1] = t2;
c[4] = t2;
c[2] = t2 * t3;
c[3] = t0 * t3;
c[5] = t1;
...
}
The effects here are a reduced number of memory accesses and reduced number of
computations. This is balanced by an increase in register pressure due to "cached"
loads and common sub-expressions.
Since register pressure is a critical issue in many CUDA codes, use of restricted
pointers can have negative performance impact on CUDA code, due to reduced
occupancy.
This appendix gives the formula used to compute the value returned by the texture
functions of Section B.8 depending on the various attributes of the texture reference
(see Section 3.2.4).
The texture bound to the texture reference is represented as an array T of N texels
for a one-dimensional texture, N M texels for a two-dimensional texture, or
N M L texels for a three-dimensional texture. It is fetched using texture
coordinates x , y , and z .
A texture coordinate must fall within T ‟s valid addressing range before it can be
used to address T . The addressing mode specifies how an out-of-range texture
coordinate x is remapped to the valid range. If x is non-normalized, only the clamp
addressing mode is supported and x is replaced by 0 if x 0 and N 1 if N x . If
x is normalized:
tex(x)
T[3]
T[0]
T[2]
T[1]
x
0 1 2 3 4 Non-Normalized
tex(x)
T[3]
T[0]
T[2]
T[1]
x
0 1 2 3 4 Non-Normalized
TL(x)
T[3]
T[0]
T[2]
T[1]
x
0 4/3 8/3 4
0 1/3 2/3 1
The general specifications and features of a compute device depend on its compute
capability (see Section 2.5).
Section G.1 gives the features and technical specifications associated to each
compute capability.
Section G.2 reviews the compliance with the IEEE floating-point standard.
Section G.3 and 0 give more details on the architecture of devices of compute
capability 1.x and 2.x, respectively.
Compute Capability
Compute Capability
G.3.1 Architecture
For devices of compute capability 1.x, a multiprocessor consists of:
8 CUDA cores for integer and single-precision floating-point arithmetic
operations,
1 double-precision floating-point unit for double-precision floating-point
arithmetic operations,
2 special function units for single-precision floating-point transcendental
functions (these units can also handle single-precision floating-point
multiplications),
1 warp scheduler.
To execute an instruction for all threads of a warp, the warp scheduler must
therefore issue the instruction over:
4 clock cycles for an integer or single-precision floating-point arithmetic
instruction,
32 clock cycles for a double-precision floating-point arithmetic instruction,
16 clock cycles for a single-precision floating-point transcendental instruction.
A multiprocessor also has a read-only constant cache that is shared by all functional
units and speeds up reads from the constant memory space, which resides in device
memory.
Multiprocessors are grouped into Texture Processor Clusters (TPCs). The number of
multiprocessors per TPC is:
2 for devices of compute capabilities 1.0 and 1.1,
3 for devices of compute capabilities 1.2 and 1.3.
Each TPC has a read-only texture cache that is shared by all multiprocessors and
speeds up reads from the texture memory space, which resides in device memory.
Each multiprocessor accesses the texture cache via a texture unit that implements
the various addressing modes and data filtering mentioned in Section 3.2.4.
The local and global memory spaces reside in device memory and are not cached.
More precisely, the following protocol is used to determine the memory transactions
necessary to service all threads in a half-warp:
Find the memory segment that contains the address requested by the lowest
numbered active thread. The segment size depends on the size of the words
accessed by the threads:
32 bytes for 1-byte words,
64 bytes for 2-byte words,
128 bytes for 4-, 8- and 16-byte words.
Find all other active threads whose requested address lies in the same segment.
Reduce the transaction size, if possible:
If the transaction size is 128 bytes and only the lower or upper half is used,
reduce the transaction size to 64 bytes;
If the transaction size is 64 bytes (originally or after reduction from 128
bytes) and only the lower or upper half is used, reduce the transaction size
to 32 bytes.
Carry out the transaction and mark the serviced threads as inactive.
Repeat until all threads in the half-warp are serviced.
double dataIn;
shared_lo[BaseIndex + tid] = __double2loint(dataIn);
shared_hi[BaseIndex + tid] = __double2hiint(dataIn);
double dataOut =
__hiloint2double(shared_hi[BaseIndex + tid],
shared_lo[BaseIndex + tid]);
This might not always improve performance however and does perform worse on
devices of compute capabilities 2.x.
The same applies to structure assignments. The following code, for example:
__shared__ struct type shared[32];
struct type data = shared[BaseIndex + tid];
results in:
Three separate reads without bank conflicts if type is defined as
struct type {
float x, y, z;
};
since each member is accessed with an odd stride of three 32-bit words;
Two separate reads with bank conflicts if type is defined as
struct type {
float x, y;
};
since each member is accessed with an even stride of two 32-bit words.
G.4.1 Architecture
For devices of compute capability 2.x, a multiprocessor consists of:
For devices of compute capability 2.0:
32 CUDA cores for integer and floating-point arithmetic operations,
4 special function units for single-precision floating-point transcendental
functions,
For devices of compute capability 2.1:
48 CUDA cores for integer and floating-point arithmetic operations,
8 special function units for single-precision floating-point transcendental
functions,
2 warp schedulers.
At every instruction issue time, each scheduler issues:
One instruction for devices of compute capability 2.0,
Two instructions for devices of compute capability 2.1,
for some warp that is ready to execute, if any. The first scheduler is in charge of the
warps with an odd ID and the second scheduler is in charge of the warps with an
// Host code
// Runtime API
// cudaFuncCachePreferShared: shared memory is 48 KB
// cudaFuncCachePreferL1: shared memory is 16 KB
// cudaFuncCachePreferNone: no preference
cudaFuncSetCacheConfig(MyKernel, cudaFuncCachePreferShared)
// Driver API
// CU_FUNC_CACHE_PREFER_SHARED: shared memory is 48 KB
// CU_FUNC_CACHE_PREFER_L1: shared memory is 16 KB
// CU_FUNC_CACHE_PREFER_NONE: no preference
CUfunction myKernel;
cuFuncSetCacheConfig(myKernel, CU_FUNC_CACHE_PREFER_SHARED)
The default cache configuration is "prefer none," meaning "no preference." If a
kernel is configured to have no preference, then it will default to the preference of
the current thread/context, which is set using
cudaThreadSetCacheConfig()/cuCtxSetCacheConfig() (see the
reference manual for details). If the current thread/context also has no preference
(which is again the default setting), then whichever cache configuration was most
recently used for any kernel will be the one that is used, unless a different cache
configuration is required to launch the kernel (e.g., due to shared memory
requirements). The initial configuration is 48KB of shared memory and 16KB of L1
cache.
Multiprocessors are grouped into Graphics Processor Clusters (GPCs). A GPC includes
four multiprocessors.
Each multiprocessor has a read-only texture cache to speed up reads from the
texture memory space, which resides in device memory. It accesses the texture cache
via a texture unit that implements the various addressing modes and data filtering
mentioned in Section 3.2.4.
Threads: 0 … 31
Threads: 0 … 31
Threads: 0 … 31
Unlike for devices of compute capability 1.x, there are no bank conflicts for arrays
of doubles accessed as follows, for example:
__shared__ double shared[32];
double data = shared[BaseIndex + tid];
128-Bit Accesses
The majority of 128-bit accesses will cause 2-way bank conflicts, even if no two
threads in a quarter-warp access different addresses belonging to the same bank.
Therefore, to determine the ways of bank conflicts, one must add 1 to the
maximum number of threads in a quarter-warp that access different addresses
belonging to the same bank.
0 0 0 0 0 0
1 1 1 1 1 1
2 2 2 2 2 2
3 3 3 3 3 3
4 4 4 4 4 4
5 5 5 5 5 5
6 6 6 6 6 6
7 7 7 7 7 7
8 8 8 8 8 8
9 9 9 9 9 9
10 10 10 10 10 10
11 11 11 11 11 11
12 12 12 12 12 12
13 13 13 13 13 13
14 14 14 14 14 14
15 15 15 15 15 15
16 16 16 16 16 16
17 17 17 17 17 17
18 18 18 18 18 18
19 19 19 19 19 19
20 20 20 20 20 20
21 21 21 21 21 21
22 22 22 22 22 22
23 23 23 23 23 23
24 24 24 24 24 24
25 25 25 25 25 25
26 26 26 26 26 26
27 27 27 27 27 27
28 28 28 28 28 28
29 29 29 29 29 29
30 30 30 30 30 30
31 31 31 31 31 31
Left: Linear addressing with a stride of one 32-bit word (no bank conflict).
Middle: Linear addressing with a stride of two 32-bit words (2-way bank conflicts).
Right: Linear addressing with a stride of three 32-bit words (no bank conflict).
0 0 0 0 0 0
1 1 1 1 1 1
2 2 2 2 2 2
3 3 3 3 3 3
4 4 4 4 4 4
5 5 5 5 5 5
6 6 6 6 6 6
7 7 7 7 7 7
8 8 8 8 8 8
9 9 9 9 9 9
10 10 10 10 10 10
11 11 11 11 11 11
12 12 12 12 12 12
13 13 13 13 13 13
14 14 14 14 14 14
15 15 15 15 15 15
16 16 16 16 16 16
17 17 17 17 17 17
18 18 18 18 18 18
19 19 19 19 19 19
20 20 20 20 20 20
21 21 21 21 21 21
22 22 22 22 22 22
23 23 23 23 23 23
24 24 24 24 24 24
25 25 25 25 25 25
26 26 26 26 26 26
27 27 27 27 27 27
28 28 28 28 28 28
29 29 29 29 29 29
30 30 30 30 30 30
31 31 31 31 31 31
NVIDIA Corporation
2701 San Tomas Expressway
Santa Clara, CA 95050
www.nvidia.com