Programming in Parallel With CUDA A Practical Guide (Richard Ansorge)
Programming in Parallel With CUDA A Practical Guide (Richard Ansorge)
CUDA is now the dominant language used for programming GPUs; it is one of the
most exciting hardware developments of recent decades. With CUDA, you can use a
desktop PC for work that would have previously required a large cluster of PCs or
access to an HPC facility. As a result, CUDA is increasingly important in scientific
and technical computing across the whole STEM community, from medical physics
and financial modelling to big data applications and beyond.
This unique book on CUDA draws on the author’s passion for and long experi-
ence of developing and using computers to acquire and analyse scientific data. The
result is an innovative text featuring a much richer set of examples than found in any
other comparable book on GPU computing. Much attention has been paid to the
Cþþ coding style, which is compact, elegant and efficient. A code base of examples
and supporting material is available online, which readers can build on for their own
projects.
Richard Ansorge
University Printing House, Cambridge CB2 8BS, United Kingdom
314–321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre, New Delhi – 110025, India
www.cambridge.org
Information on this title: www.cambridge.org/9781108479530
DOI: 10.1017/9781108855273
A catalogue record for this publication is available from the British Library.
vii
viii Contents
5 Textures 142
5.1 Image Interpolation 143
5.2 GPU Textures 144
5.3 Image Rotation 146
5.4 The Lerp Function 147
5.5 Texture Hardware 151
5.6 Colour Images 156
sanet.st
5.7 Viewing Images 157
5.8 Affine Transformations of Volumetric Images 161
5.9 3D Image Registration 167
5.10 Image Registration Results 175
9 Scaling Up 293
9.1 GPU Selection 295
9.2 CUDA Unified Virtual Addressing (UVA) 298
9.3 Peer-to-Peer Access in CUDA 299
9.4 CUDA Zero-Copy Memory 301
9.5 Unified Memory (UM) 302
9.6 A Brief Introduction to MPI 313
x
List of Figures xi
Appendix Figures
A.1 ToolKit version 10.2 install directory on Windows 10 379
A.2 CUDA samples directory on Windows 10 380
D.1 Normal scalar and AVX2 eight-component vector multiplication 394
xii List of Figures
sanet.st
Tables
xiii
xiv List of Tables
Appendix Tables
A.1 NVIDIA GPU generations, 2007–2021 375
A.2 NVIDIA GPUs from Kepler to Ampere 376
A.3 Evolution of the CUDA toolkit 378
B.1 Atomic functions 383
D.1 Evolution of the SIMD instruction set on Intel processors 394
E.1 Intrinsic types in C++ (for current Intel PCs) 404
G.1 The CX header files 411
G.2 IO functions supplied by cxbinio.h 416
G.3 Possible flags used in cudaTextureDesc 424
sanet.st
Examples
xv
xvi List of Examples
Appendix Examples
B.1 Use of atomicCAS to implement atomicAdd for ints 384
B.2 Use of atomicCAS to implement atomicAdd for floats 385
C.1 Build command generated by Visual Studio 387
D.1 Comparison of Intel ICC and VS compilers 395
D.2 Intel intrinsic functions for AVX2 397
xviii List of Examples
sanet.st
Preface
This book has been primarily written for people who need lots of computing power,
including those engaged in scientific research who need this power to acquire, process,
analyse or model their data. People working with medical data who need to process ever-
larger data sets and more complicated image data are also likely to find this book helpful.
Complicated and demanding computations are something I have been doing for my entire
research career, firstly in experimental high-energy physics and more recently in various
applications of medical imaging. The advent of GPU computing is one of the most exciting
developments I have yet seen, and one reason for writing this book is to share that
excitement with readers.
It seems to be a corollary of Moore’s law that the demand for computing power increases
to always exceed what is currently available. Since the dawn of the PC age in the early
1980s, vendors have been providing supplementary cards to improve the speed of rendering
displays. These cards are now known as graphics processing units or GPUs, and, driven by
the demands of the PC gaming industry, they have become very powerful computing engines
in their own right. The arrival in 2007 of the NVIDIA CUDA Toolkit for writing software
that exploits the power of GPUs for scientific applications was a game changer. Suddenly we
got a step up in computing power by a factor of 100 instead of the usual doubling every
18 months or so. Since then, the power of GPUs has also continued to grow exponentially
over time, following and even exceeding Moore’s law. Thus, knowing how to program
GPUs is just as useful today as it was in 2007. In fact, today, if you want to engage with
high-performance computing (HPC) perhaps on world-class supercomputers, knowing how
to use GPUs is essential.
Up till about 2002 the exponential growth in PC computing power was largely due to
increasing clock speeds. However, since then, clock speeds have plateaued at around
3.5 GHz, but the number of cores in a CPU chip has steadily increased. Thus, parallel
programming, which uses many cooperating cores running simultaneously to share the
computing load for a single task, is now essential to get the benefit from modern hardware.
GPUs take parallel programming to the next level, allowing thousands or even millions of
parallel threads to cooperate in a calculation.
Scientific research is difficult, and competitive, available computing power is often a
limiting factor. Speeding up an important calculation by a factor of, say, 200 can be a game
changer. A running time of a week is reduced to less than one hour, allowing for same-day
analysis of results. A running time of one hour would be reduced to 18 seconds, allowing for
exploration of the parameter space of complex models. A running time of seconds is reduced
xix
xx Preface
to milliseconds, allowing for interactive investigation of computer models. This book should
be particularly useful to individual researchers and small groups who can equip their own in-
house PCs with GPUs and get these benefits. Even groups with good access to large HCP
facilities would benefit from very rapid tools on their own desktop machine to explore
features of their results.
Of course, this book is also suitable for any reader interested in finding out more about
GPUs and parallel programming. Even if you already know a little about the subject, we
think you will find studying our coding style and choice of examples rewarding.
To be specific, this book is about programming NVIDIA GPUs in C++. I make no
apology for concentrating on a specific vendor’s products. Since 2007 NVIDIA have
become a dominant force in HPC and, more recently, also AI. This is due to both the cost-
effectiveness of their GPUs and, just as importantly, the elegance of the C++-like CUDA
language. I know that some scientific programming is still carried out in various dialects of
Fortran (including Fortran IV, a language I was very fond of in the early 1980s). But C++ is,
in my opinion, more expressive. Fans of Fortran may point out that there is a technical
problem with optimising C++ code using pointers, but that problem was overcome in C++11
with the introduction of the restrict keyword in C11. This keyword is also supported by
modern C++ compilers, and it is used in many of our examples.
The examples are one feature that distinguishes this book from other current books on
CUDA. Our examples have been carefully crafted from interesting real-world applications,
including physics and medical imaging, rather than the rather basic (and frankly boring)
problems often found elsewhere. Another difference between this book and others is that we
sanet.st
have taken a lot of care over the appearance of our code, using modern C++ where
appropriate, to reduce verbosity while retaining simplicity. I feel this is really important; in
my experience most scientific PhD students learn computing by modifying other people’s
code, and, while much of the CUDA example code currently circulating works, it is far from
elegant. This may be because in 2007 CUDA was launched as an extension to C, not C++,
and most of the original SDK examples were written in a verbose C style. It is unfortunate
that that style still persists in many of the online CUDA tutorials and books. The truth is that
CUDA always supported some C++, and nowadays CUDA fully supports up to C++17
(albeit with a few restrictions). In November 2019 the venerable “NVIDIA C Programmers
Guide” was renamed the “NVIDIA C++ Programmers Guide”, and, although then there was
no significant change to the content of the guide, it did signal a change in NVIDIA’s attitude
to their code, and since 2020 some more advanced uses of C++ have started to appear in the
SDK examples.
This book does not aim to teach you C++ from scratch; some basic knowledge of C++ is
assumed. However Appendix I discusses some of the C++ features used in our examples.
Modern C++ is actually something of a monster, with many newer features to support object-
orientated and other high-level programming styles. We do not use such features in this
book, as, in our view, they are not appropriate for implementing the algorithmic code we run
on GPUs. We also favour template functions over virtual functions.
To get the most out of our book, you will need access to a PC equipped with an NVIDIA
GPU supporting CUDA (many of them do). The examples were developed using a Windows
10 PC with a 4-core Intel CPU and an NVIDIA RTX 2070 GPU (costing £480 in 2019).
A Linux system is also fine, and all our examples should run without modification. Whatever
Preface xxi
system you have, you will need a current version of the (free) NVIDIA CUDA Toolkit. On
Windows, you will also need Visual Studio C++ (the free community version is fine). On
Linux, gcc or g++ is fine.
Sadly, we cannot recommend CUDA development on macOS, since Apple do not use
NVIDIA cards on their hardware and their drivers do not support recent NVIDIA cards. In
addition, NVIDIA have dropped support for macOS starting with their Toolkit version 11.0,
released in May 2020.
All of the example code can be downloaded from https://github.com/RichardAns/CUDA-
Programs. This site will also contain errata for the inevitable bugs that some of you may find
in my code. By the way, I welcome reader feedback about bugs or any other comments. My
email address is rea1@cam.ac.uk. The site will be maintained, and I also hope to add some
additional examples from time to time.
I hope you enjoy reading my book as much as I have enjoyed writing it.
sanet.st
1
This book aims to teach you how to use graphics processing units (GPUs) and Compute
Unified Device Architecture (CUDA) to speed up your scientific or technical computing
tasks. We know from personal experience that the best way to learn to speak a new language
is to go to the relevant country and immerse yourself in the culture. Thus, we have chosen to
start our book with a complete working example of an interesting problem. We present three
versions of the code, firstly a standard Cþþ implementation for a single central processing
unit (CPU) thread, and secondly a multithread CPU version suitable for running on one or
two threads on each core for a multicore CPU, say between 4 and 16 threads. The third
version uses CUDA to run with thousands of simultaneous threads. We don’t expect readers
to immediately grasp all the nuances in the CUDA code – that is what the rest of this book is
for. Rather I hope you will see how similar the code is in all three versions and be
encouraged that GPU programming is not difficult and that it brings huge rewards.
After discussing these introductory examples, we go on to briefly recap the architecture of
traditional PCs and then introduce NVIDIA GPUs, introducing both their hardware features
and the CUDA programming model.
1.1 Background
A modern PC processor now has two, four or more computing CPU cores. To get the best
from such hardware, your code has to be able to run in parallel on all the resources available.
In favourable cases, tools like OpenMP or the Cþþ11 thread class defined in <thread>
allow you to launch cooperating threads on each of the hardware cores to get a potential
speed-up proportional to the number of cores. This approach can be extended to clusters of
PCs using communication tools like Message Passing Interface (MPI) to manage the inter-
PC communication. PC clusters are indeed now the dominant architecture in high-
performance computing (HPC). A cluster of at least 25 PCs with 8-core CPUs would be
needed to give a factor of 200 in performance. This is doable but expensive and incurs
significant power and management overheads.
An alternative is to equip your PC with a modern, reasonably high-specification GPU. The
examples in this book are based on an NVIDIA RTX 2070 GPU, which was bought for £480
in March 2019. With such a GPU and using NVIDIA’s Cþþ-like CUDA language, speed-
ups of 200 and often much more can be obtained on a single PC with really quite modest
effort. An additional advantage of the GPU is that its internal memory is about 10 times
faster than that of a typical PC, which is extremely helpful for problems limited by memory
bandwidth rather than CPU power.
1
2 Introduction to GPU Kernels and Hardware
At the heart of any CUDA program are one or more kernel functions, which contain the
code that actually runs on the GPU. These kernel functions are written in standard Cþþ with
a small number of extensions and restrictions. We believe they offer an exceptionally clear
and elegant way of expressing the parallel content of your programs. This is why we have
chosen CUDA for this book on parallel programming. One feature that distinguishes the
book from other books on CUDA is that we have taken great care to provide interesting real-
world problems for our CUDA examples. We have also coded these examples using features
of modern Cþþ to write straightforward but elegant and compact code. Most of the
presently available online tutorials or textbooks on CUDA use examples heavily based on
those provided by the NVIDIA Software Development Kit (SDK) examples. These
examples are excellent for demonstrating CUDA features but are mostly coded in a verbose,
outdated C style that often hides their underlying simplicity.1
To get the best from CUDA programs (and, indeed, any other programming language), it
is necessary to have a basic understanding of the underlying hardware, and that is the main
topic of this introductory chapter. But, before that, we start with an example of an actual
CUDA program; this is to give you a foretaste of what is to come – the details of the code
presented here are fully covered in later chapters.
02 #include <stdio.h>
03 #include <stdlib.h>
04 #include "cxtimers.h"
20 double pi = 3.14159265358979323;
21 double step_size = pi/(steps-1); // n-1 steps
22 cx::timer tim;
23 double cpu_sum = 0.0;
24 for(int step = 0; step < steps; step++){
25 float x = step_size*step;
26 cpu_sum += sinsum(x, terms); // sum of Taylor series
27 }
28 double cpu_time = tim.lap_ms(); // elapsed time
We will show three versions of this example. The first version, cpusum, is shown in
Example 1.1 and is written in straightforward Cþþ to run on a single thread on the host PC.
The second version, ompsum, shown in Example 1.2 adds two OpenMP directives to the
first version, which shares the loop over steps between multiple CPU threads shared
equally by all the host CPU cores; this illustrates the best we can do on a multicore PC
without using the GPU. The third version, gpusum, in Example 1.3 uses CUDA to share the
work between 109 threads running on the GPU.
wish to exploit limited memory bandwidth and to improve calculation speed. For scientific work, the
final results rarely need to be accurate to more than a few parts in 10‒8 (a single bit error in an IEEE
4-byte float corresponds to a fractional error of 2‒24 or ~6 10‒8). But, of course, we must be careful
that errors do not propagate as calculations progress; as a precaution the variable cpusum in the main
routine is an 8-byte double.
• Lines 2–4: Include standard headers; the header cxtimers.h is part of our cx utilities and
provides portable timers based on the Cþþ11 chrono.h library.
• Lines 5–15: This is the sinsum function, which evaluates sin(x) using the standard Taylor series.
The value of x in radians is given by the first input argument x, and the number of terms to be used is
given by the second input argument terms.
• Lines 7–9: Initialise some working variables; term is the value of the current term in the Taylor
series, sum is the sum of terms so far, and x2 is x2.
• Lines 10–13: This is the heart of our calculation, with a loop where successive terms are calculated
in line 11 and added to sum in line 12. Note that line 11 is the single line where all the time-
consuming calculations happen.
The main function of the remining code, in lines 16–35, is to organise the calculation in a straightfor-
ward way.
• Lines 18–19: Set the parameters steps and terms from optional user input.
• Line 21: Set the step size required to cover the interval between 0 and π using steps steps.
• Line 22: Declare and start the timer tim.
• Lines 23–27: A for loop to call the function sinsum steps times while incrementing x in to
cover the desired range. The results are accumulated in double cpusum.
• Line 28: Store the elapsed (wall clock) time sanet.st
since line 22 in cpu_time. This member function
also resets the timer.
• Lines 30–31: To get the integral of sin(x), we perform end-point corrections to cpusum and scale by
step_size (i.e. dx).
• Line 31: Print result, including time, is ms. Note that the result is accurate to nine significant figures
in spite of using floats in the function sinsum.
The example shows a typical command line launch requesting 106 steps and 103 terms in each step.
The result is accurate to nine significant figures. Lines 11 and 12 are executed 109 times in 1.8 seconds,
equivalent to a few GFlops/sec.
In the second version, Example 1.2, we use the readily available OpenMP library to share
the calculation between several threads running simultaneously on the cores of our
host CPU.
02 #include <stdio.h>
03 #include <stdlib.h>
03.5 #include <omp.h>
04 #include "cxtimers.h"
1.2 First CUDA Example 5
• Line 3.5: An extra line to include the header file omp.h. This has all the necessary definitions
required to use OpenMP.
• Line 19.5: An extra line to add the user-settable variable threads, which sets the number of CPU
threads used by OpenMP.
• Line 23.5: This is actually just a function call that tells openMP how many parallel threads to use. If
omitted, the number of hardware cores is used as a default. This function can be called more than
once if you want to use different numbers of cores in different parts of your code. The variable
threads is used here.
• Line 23.6: This line sets up the parallel calculation. It is a compile time directive (or pragma) telling
the compiler that the immediately following for loop is to be split into a number of sub-loops, the
range of each sub-loop being an appropriate part of the total range. Each sub-loop is executed in
parallel on different CPU threads. For this to work, each sub-loop will get a separate set of the loop
variables, x and omp_sum (n.b.: We use omp_sum instead of cpu_sum in this section of the
code). The variable x is set on each pass through the loop with no dependencies on previous passes,
so parallel execution is not problematic. However, that is not the case for the variable omp_sum,
6 Introduction to GPU Kernels and Hardware
which is supposed to accumulate the sum of all the sin(x) values. This means the sub-loops have
to cooperate in some way. In fact, the operation of summing a large number of variables, held either
in an array or during loop execution, occurs frequently and is called a reduce operation. Reduce is an
example of a parallel primitive, which is a topic we discuss in detail in Chapter 2. The key point is
that the final sum does not depend on the order of the additions; thus, each sub-loop can accumulate
its own partial sum, and these partial sums can then be added together to calculate the final value of
the sum_host variable after the parallel for. The last part of the pragma tells OpenMP that the
loop is indeed a reduction operation (using addition) on the variable omp_sum. OpenMP will add
the partial sums accumulated by each thread’s copy of omp_sum and place the final result into the
omp_sum variable in our code at the end of the loop.
• Line 32: Here we have simply modified the existing printf to also output the value of threads.
Two command line launches are shown at the end of this example, the first using four OMP threads
and the second using eight OMP threads.
The results of running ompsum on an Intel quad-core processor with hyper-threading are
shown at the bottom of the example using either four or eight threads. For eight threads the
speed-up is a factor of 3.8 which is a good return for little effort. Note using eight cores
instead of four for our PC means running two threads on each core which is supported by
Intel hyper-threading on this CPU; we see a modest gain but nothing like a factor of 2.
In Visual Studio Cþþ, we also have to tell the compiler that we are using OpenMP using
the properties dialog, as shown in Figure 1.1.
In the third version, Example 1.3, we usesanet.st
a GPU and CUDA, and again we parallelise the
code by using multiple threads for the loop in lines 24–27, but this time we use a separate
thread for each iteration of the loop, a total of 109 threads for the case shown here. The code
changes for the GPU computation are a bit more extensive than was required for OpenMP,
but as an incentive to continue reading, we will find that the speed-up is now a factor of
960 rather than 3.8! This dramatic gain is an example of why GPUs are routinely used in
HPC systems.
20 double pi = 3.14159265358979323;
21 double step_size = pi / (steps-1); // NB n-1
// allocate GPU buffer and get pointer
21.1 thrust::device_vector<float> dsums(steps);
21.2 float *dptr = thrust::raw_pointer_cast(&dsums[0]);
22 cx::timer tim;
8 Introduction to GPU Kernels and Hardware
22.1 gpu_sin<<<blocks,threads>>>(dptr,steps,terms,
(float)step_size);
22.2 double gpu_sum =
thrust::reduce(dsums.begin(),dsums.end());
28 double gpu_time = tim.lap_ms(); // get elapsed time
29 // Trapezoidal Rule Correction
30 gpu_sum -= 0.5*(sinsum(0.0f,terms)+sinsum(pi, terms));
31 gpu_sum *= step_size;
32 printf("gpusum %.10f steps %d terms %d
time %.3f ms\n",gpu_sum,steps,terms,gpu_time);
33 return 0;
34 }
• Lines 1–4: These include statements are the same as in Example 1.1.
• Line 4.1: This is the standard include file needed for all CUDA programs. A simple CUDA program
just needs this, but there are others that will be introduced when needed.
• Line 4.2: This include file is part of the Thrust library and provides support for thrust vectors on the
GPU. Thrust vector objects are similar to the std::vector objects in Cþþ, but note that CUDA
has separate classes for thrust vectors in CPU memory and in device memory.
• Lines 5–15: This is the same sinsum function used in Example 1.1; the only difference is that in
line 5 we have decorated the function declaration with __host__ and __device__, which tell
the compiler to make two versions of the function, one suitable for code running on the CPU (as
before) and one for code running on the GPU. This is a brilliant feature of CUDA: literally the same
code can be used on both the host and device, removing a major source of bugs.2
• Lines 15.1–15.8: These define the CUDA kernel function gpu_sin that replaces the loop over
steps in lines 24–27 of the original program. Whereas OpenMP uses a small number of host
threads, CUDA uses a very large number of GPU threads. In this case we use 109 threads, a separate
thread for each value of step in the original for loop. Kernel functions are declared with the
keyword __global__ and are launched by the host code. Kernel functions can receive arguments
from the host but cannot return values – hence they must be declared as void. Arguments can either
1.2 First CUDA Example 9
be passed to kernels by value (good for single numbers) or as pointers to previously allocated device
memory. Arguments cannot be passed by reference, as in general the GPU cannot directly access
host memory.
Line 15.3 of the kernel function is especially noteworthy, as it encapsulates the essence of parallel
programming in both CUDA and MPI. You have to imagine that the code of the kernel function is
running simultaneously for all threads. Line 15.3 contains the magic formula used by each particular
instance of an executing thread to figure out which particular value of the index step that it
needs to use. The details of this formula will be discussed later in Table 1.1. Line 15.4 is an out-of-
range check, necessary because the number of threads launched has been rounded up to a multiple
of 256.
• Lines 15.5 and 15.6 of the kernel: These correspond to the body of the for loop (i.e. lines 25–26 in
Example 1.1). One important difference is that the results are stored in parallel to a large array in the
global GPU memory, instead of being summed sequentially to a unique variable. This is a common
tactic used to avoid serial bottlenecks in parallel code.
• Lines 16–19 of main: These are identical to the corresponding lines in Example 1.1.
• Lines 19.1–19.2: Here we define two new variables, threads and blocks; we will meet these
variables in every CUDA program we write. NVIDIA GPUs process threads in blocks. Our
variables define the number of threads in each block (threads) and the number of thread blocks
(blocks). The value of threads should be a multiple of 32, and the number of blocks can be
very large.
• Lines 20–21: These are the same as in Example 1.1.
• Line 21.1: Here we allocate an array dsum in GPU memory of size steps. This works like
std::vector except we use the CUDA thrust class. The array will be initialised to zero on
the device.
• Line 21.2: Here we create a pointer dptr to the memory of the dsum vector. This is a suitable
argument for kernel functions.
• Lines 22.1–22.2: These two lines replace the for loop in lines 23–27 of Example 1.1, which called
sinsum steps times sequentially. Here line 22.1 launches the kernel gpu_sin, which uses
steps separate GPU threads to call sinsum for all the required x values in parallel. The individual
results are stored in the device array dsums. In line 22.2 we call the reduce function from the thrust
library to add all the values stored in dsums, and then copy the result from the GPU back to the host
variable dsum.3
• Lines 28–34: These remaining lines are identical to Example 1.1; notice that the host version of our
sinsum function is used in line 30.
As a final comment we notice that the result from the CUDA version is a little less
accurate than either of the host versions. This is because the CUDA version uses 4-byte floats
throughout the calculation, including the final reduction step, whereas the host versions use
an 8-byte double to accumulate the final result sum over 106 steps. Nevertheless, the CUDA
result is accurate to eight significant figures, which is more than enough for most
scientific applications.
The sinsum example is designed to require lots of calculation while needing very little
memory access. Since reading and writing to memory are typically much slower than
performing calculations, we expect both the host CPU and the GPU to perform at their best
efficiencies in this example. In Chapter 10, when we discuss profiling, we will see that the
GPU is delivering several TFlops/sec in the example. While the sinsum function used in this
example is not particularly interesting, the brute force integration method used here could be
10 Introduction to GPU Kernels and Hardware
used for any calculable function spanned by a suitable grid of points. Here we used 109 points,
which is enough to sample a function on a 3D Cartesian grid with 1000 points along each of
the three coordinate axes. Being able to easily scale up to 3D versions of problems that can
only be reasonably done in 2D on a normal PC is another great reason to learn about CUDA.
In order to write effective programs for your GPU (or CPU), it is necessary to have some
feeling for the capabilities of the underlying hardware, and that is our next topic. So, after
this quick look at CUDA code and what it can do, it is time to go back to the beginning and
remind ourselves of the basics of computer hardware.
• Load/Save: This unit reads data from and sends data to the main memory. The unit is controlled
by the Execute logic which at each step specifies if data is to be read or written to main memory.
The data in question is transferred to or from one of the registers in the register file.
• Register File: This is the heart of a CPU; data must be stored in one or more registers in
order to be operated on by the ALU.
• ALU or Arithmetic Logic Unit: This device performs arithmetic and logical operations in
data stored in registers; at each step the required operation is specified by the execute unit.
This is the piece of hardware that actually computes!
• Execute: This unit decodes the instruction sent by the instruction fetch unit, organises the
transfer of input data to the register file, instructs ALU to perform the required operation
on the data and then finally organises the transfer of results back to memory.
• Fetch: The fetch unit fetches instructions from main memory and passes them to the
execute unit. The unit contains a register holding the program counter (PC)5 that contains
the address of the current instruction. The PC is normally incremented by one instruction
at each step so that the instructions are executed sequentially. However, if a branch is
required, the fetch unit changes the PC to point to the instruction at the branch point.
slowly but recent growth has been due to innovations in design rather than increase in
frequency. The major contribution to performance per chip since 2002 has been from
multicore technology. GPUs are excluded from this plot although recent Intel Xeon-phi
designs with hundreds of cores are becoming quite GPU-like. Remarkably the power used
by a single device has also not increased since 2002. The data are from https://github.com/
karlrupp/microprocessor-trend-data.
It is really worth taking a moment to contemplate the story told in this figure. The compute
power of individual devices has increased by a factor of more than 106 in the last 30 years
and also the number of devices has grown at an even faster rate. Many people now own
numerous devices in their smartphones, laptops, cars and household gadgets while the
internet giants run vast server farms each of which must have a processor count running
into millions. Likewise, for scientific computing the most powerful systems in the recent
TOP500 list of supercomputers also have millions of processing cores. These developments,
particularly in the last 15 years or so, have transformed society – and the process has only
just begun, there is no sign that the trends shown in the figure are about to stop.
One hope I have in writing this book, is that learning about GPUs and parallel program-
ming will be a small help in keeping up with the changes to come.
The caching scheme shown in Figure 1.4 is typical for modern CPU multi-core chips.
There are three levels of caching memories, all integrated into the CPU chip. First the level-3
(L3) cache is large at 8 MB and is shared by the 4 CPU cores. Each core also has
progressively faster L2 and L1 caches, with separate L1 caches for data and instructions.
The hardware transfers cache data in packets called cache lines of typically 64 or 128
bytes. On current Intel processors the cache line size is 64 bytes, but the hardware usually
transfers two adjacent lines at once giving an effective size of 128 bytes. The data in a cache
line corresponds a contiguous region of the main memory beginning at an address that is a
multiple of the cache line size.
themselves may differ in software features. The specific capability of a particular GPU is known as
its Compute Capability or CC, which is specified by a monotonically increasing number. The most
recent generation is Ampere with a CC value of 8.0. The examples in the book were developed
using a Turing RX 2070 GPU with a CC of 7.5. A fuller account can be found in Appendix A.
groups the cores into “warps” of 32 cores processed by a “Warp Engine” (WE) which adds
shared resources including IO, double precision units (either 1 or 16), and eight special
function units (SFUs). The SFUs evaluate 32-bit transcendental functions such as sin or exp.
All the threads in any particular CUDA warp run together on a single warp engine. Small
groups of WEs, usually two or four, are then grouped together into one SM unit. Finally,
multiple SMs are grouped together to make the GPU.
sanet.st
much about explicitly using constant memory. Notice also that this memory is quite
limited and therefore not useful for large tables of parameters.
• Texture memory: This feature is directly related to the graphics processing origins of
GPUs. Texture memory is used to store arrays of up to three dimensions and is optimised
for local addressing of 2D arrays. They are read only and have their own dedicated caches.
Textures are accessed by the special lookup functions, tex1D, tex2D and tex3D. These
functions are capable of performing very fast 1D interpolation, 2D bilinear interpolation or
3D trilinear interpolation for arbitrary input positions within the texture. This is a massive
help for image processing tasks and I highly recommend using texture lookup wherever
helpful. A number of our examples will illustrate how to use textures.
Recent versions of CUDA support additional texture functionality, including layered textures
(indexable stacks of 1D or 2D textures) and surfaces which can be written to by the GPU.
• Local memory: These are memory blocks private to each individual executing thread; they
are used as overflow storage for local variables in intermediate temporary results when the
registers available to a thread are insufficient. The compiler handles the allocation and use of
this resource. Local memory is cached via the L2 and L1 caches just like other data.
• Register file: Each SM has 64K 32-bit registers which are shared equally by the thread
blocks concurrently executing on the SM. This can be regarded as a very important
memory resource. In fact there is a limit of 64 on the maximum number of concurrent
warps (equivalent to 2K threads) that can execute on a given SM. This means that if the
compiler allows a thread to use more than 32 registers, the maximum number of thread
blocks running on the SM (i.e. occupancy) is reduced, potentially harming performance.
The NVCC compiler has a switch, --maxrregcount <number>, that can be used to
tune overall performance by trading occupancy against thread computational performance.
7
• Shared memory: Each SM provides between 32 KB and 64 KB of shared memory. If a
kernel requires shared memory the size required can be declared either at kernel launch
time or at compile time. Each concurrently executing thread block on an SM gets the same
size memory block. Thus if your kernel requires more than half of the maximum amount
of shared memory, the SM occupancy will be reduced to a single thread block per SM.
Realistic kernels would usually ask for no more than half that available memory.
Shared memory is important because it is very fast and because it provides the best way for
threads within a thread block to communicate with each other. Many of our examples feature
the use of shared memory.
Many early CUDA examples emphasise the faster memory access provided by shared
memory compared to the poor performance of the (then poorly cached) main memory.
Recent GPUs have much better memory caching so using shared memory for faster access
is less important. It is important to balance the performance gain from using shared memory
against reduced SM occupancy when large amounts of shared memory are needed.
Current GPUs use their L1 and L2 caches together with high occupancy to effectively hide
the latency of main memory accesses. The caches work most effectively if the 32 threads in a
warp access 32-bit variables in up to 32 adjacent memory locations and the starting location
is aligned on a 32-word memory boundary. Using such memory addressing patterns is called
memory coalescing in CUDA documentation. Early documentation places great emphasis on
18 Introduction to GPU Kernels and Hardware
this topic because early GPUs had poor or no caching. Modern GPUs are more forgiving so
you probably just need to stick to the golden rule: adjacent threads (or indices on CPUs)
address adjacent memory locations. The memory arrangement is shown in Figure 1.6.
sanet.st
1.10 Warps and Waves
The GPU architecture is reflected in the way a CUDA kernel is designed and launched by
host software. Designing good kernels to match particular problems requires skill and
experience and is essentially what the rest of this book is about. When you get it right, it
is very satisfying to see your code speed up by a large factor. First you have to decide how
many threads, Nthreads, to use. Choosing a good value for Nthreads is one of the most
important choices you make when designing CUDA kernels. The choice of Nthreads is of
course problem specific. For our example for the gpusum program in Example 1.3 we used
a value of Nthreads equal to the number of steps. So for 109 steps we used 109 threads
which is huge compared to the eight that can be run in parallel on our CPU. In later chapters
we will say a lot about image processing. To process a 2D image having nx × ny pixels, a
good choice is to put Nthreads = nx × ny. The key point is that Nthreads should be
big, arguably as big as possible.
If you are new to CUDA you might expect that setting Nthreads equal to Ncores, the
number of cores in your GPU, would be enough to keep the GPU fully occupied. In fact, this
is far from correct; one of the very neat features on NVIDIA GPUs is that the hardware can
hide the latencies in memory accesses or other hardware pipelines by rapidly switching
between threads and use the data for particular threads as soon as it becomes available.
To be specific the GPU used for most of our examples is a RTX 2070 which has 36 SM
units (Nsm = 36) and each SM has hardware to process two warps of 32 threads (Nwarp =
2). Thus, for this GPU Ncores = Nsm × Nwarp × 32 = 2304. What is less obvious is
that during kernel processing each SM unit has a large number of resident threads, Nres, for
1.11 Blocks and Grids 19
the RTX 2070; Nres =1024 equivalent to 32 warps. At any given instant two of these
32 warps will be active and the remainder will be suspended, possibly waiting for pending
memory requests to be satisfied. This is how the latency hiding is implemented in NVIDIA
hardware. When we launch a kernel with say 109 threads these threads are run in waves of
Nwave = Nres × Nsm threads; this is actually 27127 waves on the 2070 GPU with
Nwave = 36864, with the last wave being incomplete. Ideally the minimum number of
threads in any kernel launch should be Nwave, and if more threads are possible it should be
a multiple of Nwave.
Note that although the Turing generation of GPUs have Nres = 1024 this is unusual; all
the other recent NVIDIA GPU generations have Nres = 2048, twice the Turing value. Since
for these GPUs Nwarp = 2 is the same as for Turing, Nwaves will be twice the Turing
value. Note that for any particular GPU generation Nsm will vary with model number, e.g.
Nsm = 46 for the RTX 2080 GPU, but Nwarp will not. Thus, Nwave will vary like Nsm
with GPU model for a given GPU generation.
Table 1.1 CUDA built-in variables. NB the first 4 are structs containing x, y and z members
variable comment
threadIdx id = threadIdx.x is thread rank in thread block
id = blockDim.x*blockIdx.x+threadIdx.x is thread rank in grid
blockIdx blockIdx.x is block rank in grid of blocks
blockDim blockDim.x is number of threads in one block
gridDim gridDim.x is number of blocks in the grid
threads = gridDim.x*blockDim.x is total number of threads in launch
warpSize Number of threads in warp, set to 32 on all current GPUs
1.12 Occupancy
NVIDIA define occupancy as the ratio of the number of threads actually resident in the SM
units compared to the maximum value Nres. Occupancy is usually expressed as a percent-
age. Full occupancy of 100 per cent is the same as saying that complete waves are running
on the SMs of the GPU.
Even if we launch a kernel with sufficient threads to achieve 100 per cent occupancy we
might not actually achieve full occupancy. The reason for this is that each SM has a limited
total shared memory size and a limited number of registers. If our thread block size is
256 then full occupancy will only be achieved if four (or eight ) threads bocks are resident on
each SM which reduces the resources available to each thread block by the same factor.
NVIDIA GPUs have enough registers sanet.st for each thread to use up to 32 registers while
maintaining full occupancy. Shared memory is more difficult as it is typically limited to
64 or 96 KB per SM which is equivalent to only 32 or 48 bytes per thread at full occupancy
for non-Turing GPUs. On the latest Ampere GPUs this is increased to 80 bytes.
Less than full occupancy is not necessarily bad for performance, especially if the kernel is
compute bound rather than memory bound, you may have to accept lower occupancy if your
kernel needs significant amounts of shared memory. Experimentation may be necessary in
these cases; using global memory instead of shared memory and relying on L1 caching for
speed may by a good compromise on modern GPUs.
Kernel code can use the built-in variables shown in Table 1.1 to determine the rank of a
thread in its thread block and in the overall grid. Only the 1D case is shown in this table the
2D and 3D cases are discussed in the next chapter.
This is the end of our introductory chapter. In Chapter 2 we introduce the more general
ideas behind parallel programming on SIMD machines and GPUs. We then give some more
detailed examples, including the classic problem of parallel reduction. We also discuss
kernel launches in more detail.
Endnotes Chapter 1
1 For example, the Nvidia plugin for Visual Studio Cþþ helpfully generates a sample program to get you
started but unfortunately that program is full of goto statements. The use of goto has of course been
deprecated since the early 1980s.
2 Both the Nvidia documentation and their example code mostly refer to the CPU as the host and to the
GPU as the device. We often but not always follow this convention.
1.12 Occupancy 21
3 Later we will discuss reduce operation in some detail as an example of a parallel primitive. Our best
CUDA code for this operation will turn out to be a bit faster than using thrust.
4 Computers which store instructions and data in a common memory are said to have von Neumann
architecture, after the physicist John von Neumann who worked on the design of the early 1945 ADVC
machine. Arguably the idea of treating computer instructions as data can also be credited to Ada
Lovelace in the 1840s. The alternative Harvard architecture uses separate hardware to store data and
instructions. Examples include the Colossus computers, used at Bletchley Park from 1943, which were
programmed using switches and plugs. Paper tape could also be used to hold instructions. In a curious
case of nominative determinism, the English mathematician, Max Newman played an important role in
the design of Colossus – truly these “new men” ushered in our digital age. Today Harvard architecture is
still used for specialised applications, e.g. embedded systems running fixed programs stored in read only
memory units (ROMs).
5 Unfortunately, the acronym PC for program counter is the same as for personal computer. Actually, the
former use predates the introduction of personal computers by at least 20 years, thus we will use PC for
both. This should not be confusing as we will rarely use PC for program counter.
6 Recently the Volta/Turing generation of GPUs launched in 2017 has relaxed this restriction somewhat.
This is discussed later as an advanced topic.
7 This depends on the compute capability of a particular device. Most devices can be configured to have at
least 48 KB.
2
Computers have always been very good at giving users the impression that they can perform
multiple tasks at the same time. For example, even a single core PC will allow you to browse
the web while running a lengthy calculation in the background. However, this is accom-
plished by fast task switching – the user gets CPU cycles when active, otherwise the CPU
cycles are given to the calculation. This is resource sharing not true parallel programming.
If you have a more recent 4-core PC, you might launch four instances of your lengthy
calculation with different parameter values to genuinely perform parallel computation without
any extra programming effort. Problems that can be solved with this approach are sometimes
called “trivially parallel”. In spite of the name, this approach is perfectly valid if it gets your
job done effectively and has the enormous advantage of requiring little extra programming
effort. If the potential number of jobs is large, a simple script file might be useful to automate
the launching of new jobs and collecting their results. A nice example is the CERN data centre
sanet.st
which has more than 200,000 cores mostly running event data processing or Monte Carlo
simulation using the same program but different event data or different random numbers.
Unfortunately, the trivial programming approach does not work on GPUs which have very simple
processing cores designed to work together on a single task. True parallel programming requires just
this – many processing cores working together to complete a single task. It turns out that writing
effective parallel code is often rather straightforward – as hopefully we demonstrate in this book.
22
2.1 Flynn’s Taxonomy 23
instruction set into the Pentium III architecture; these instructions could perform operations
on vectors of four 32-bit floating point numbers using what were effectively 128-bit
registers. Over the years the capabilities of Intel SIMD operations have increased; currently
Intel supports advanced vector extensions using 512-bit registers (AVX-512 ), enough for up
to 16 32-bit numbers. We discuss this topic in more detail in Appendix D.
The third, MIMD, is effectively just a set of separate CPUs performing separate tasks. This
case includes both modern multicore PCs running Linux or Windows and clusters of PCs
connected by a network. In both cases suitable software, for example, MPI or OpenMP, can be
used to allow the multiple independent processors to work together on a single computing task.
The fourth, MISD, is included for the sake of completeness and is rarely used. It might be used
in specialised embedded systems requiring redundancy against failure; for example, satellites.
The final, SIMT, was introduced by NVIDIA as a variation of SIMD to describe their GPU
architecture. Although both are used to tackle similar scientific computations, there are differences
between them. In the SIMD model a relatively small number of threads use vector hardware to
process data. In the SIMT model a large number of threads are used to process individual data
items. If a common instruction is used by all threads then SIMD behaviour is replicated, but the
SIMTarchitecture also permits threads to perform divergent operations which, while it may lead to
a drop in performance, also allows for more versatile code. In the recent Volta and Turing
generations of GPU, NVIDIA have extended the capabilities for programming individual threads.
It is the SIMD/T case that is of interest for parallel programming. We look for sections of
our code where the same operation is performed on multiple data items – obvious candidates
are for loops. However, if we want to share the computation of a loop across multiple
threads it is important that there are no dependencies between passes through the loop, for
example the order in which the loop traversals are executed should not matter.
Consider again the loop in the Example 1.1
The loop statements shown in lines 25–26 are independent of other passes through the
loop, because the order in which we sum the values in the variable sum_host does affect
the final result.2 Thus, this loop is a good candidate for parallel code, particularly so if the
evaluation of the function sin_host is computationally expensive. Before we proceed
there is a subtle technical problem to resolve. Either the variable sin_host must be
global – therefore, visible to all the threads participating in the parallel calculation or some
other means must be found to get the correct final sum.
Making sin_host global to all threads, while straightforward to implement, introduces
yet another complication – if two or more threads try to update the variable simultaneously
the result will be undefined! With CUDA, one thread will succeed and the attempts by other
threads at simultaneous update will be ignored, so the final answer will be wrong. There is a
fix for this problem, which is actually a generic issue for all parallel computing platforms; it
is to use so-called Atomic operations to perform the required operation serially. Atomic
operations are usually implemented by calling platform specific functions, and their use in
CUDA code is discussed in Appendix B. For now we note that using atomics might slow
down a calculation and we choose an alternative approach, which is to simply store the
individual values returned by the sinsum function in separate elements of a large array. The
elements of the array will be summed together in a separate step once they have all been
calculated. This is an example of parallel thinking; we separate a serial loop into a part that
can be done in parallel – calling the simsum many times, and a part which cannot be done in
parallel – the reduce operation of adding up all the individual stored values. These two steps
are the only parts of the calculation that will be done on the GPU; code running on the CPU
sanet.st
takes care of everything else.
Now it is time to look in more detail at our first CUDA program emphasising the
steps necessary to convert the serial version in Example 1.1 to the parallel version in
Example 1.3.
The first step is to add the header files needed for CUDA
The headers added are cuda_runtime.h which provides basic support for CUDA and
thrust/device_vector.h provides a container class like std::vector for 1D
arrays in GPU memory. Most of the examples in this book use these headers.
We then convert the function sinsum into a function that can be used on both the CPU
and the GPU. This is simply done be adding a CUDA keyword to the function declaration.
The keyword __device__ tells the compiler to compile a version of the function that runs
on the GPU and can be called by kernels and other functions running on the GPU. Likewise,
__host__ tells the compiler to create a version of the function for the CPU code to use. The
2.1 Flynn’s Taxonomy 25
inline keyword is part of standard C++ and tells the compiler to generate function code
embedded in the caller’s code, removing the overhead of a function call at the price of
increasing the size of the final exe file. In CUDA inline is the default for __device__
functions. The __host__ keyword is only needed if the function is to be used on both the
host and device; for device only functions just __device__ is needed. The entire body of the
function in lines 6–15 is unchanged. This is a very powerful feature of CUDA.
It is also possible to have two different versions of a function, one declared with
__device__ and one declared with __host__. The __host__ prefix could be omitted
from the host version as this is the default, but we recommend using it to make your
intentions clear. Obviously this prefix is not needed (or recommended) for functions which
are only used by the host.
The __device__ version of sinsum is simply a GPU function and is not callable
directly from the host. We need to write a separate CUDA kernel function which runs on the
GPU and can be called from the host. CUDA kernels are declared using __global__
instead of __device__ this reflects their dual nature – callable by the host but running on
the GPU. In the CUDA world people talk about “launching” kernels rather than “calling”
them so that is what we shall do from now on. Our first kernel gpu_sum in lines 15.1–15.8
is all new code which replaces most of lines 23–27 in the original program.
The kernel declaration in line 15.1 looks very much like a normal C++ declaration
except for the prefix __global__. There are, however, some restrictions based on the fact
that although the kernel is called from the host it cannot access any memory on the host. All
kernels must be declared void and their arguments are restricted to scalar items or pointers
to previously allocated regions of device memory. All kernel arguments are passed by value.
In particular, references are not allowed. It is not a good idea to try and pass large C++
objects to kernels; this is because they will be passed by value and there may be significant
copying overheads. Also any changes made by the kernel will not be reflected back in the
host’s copy after the kernel call. Additionally, any C++ classes or structs passed to a kernel
must have __device__ versions of all their member functions.
• Line 15.3 declares a variable step equivalent to the for loop index variable of the same
name in line 24 of Example 1.1. It is set to a value defined by the built-in variables
blockDim.x, blockIdx.x and threadIdx.x. The values of these variables depend
on the launch parameters used in the host call to the kernel as follows:
26 Thinking and Coding in Parallel
○ blockDim.x will be set to threads, i.e. the thread block size used by the
kernel.
○ blockIdx.x will be set to the rank of the thread block to which the current thread
belongs and will be in the range [0,blocks-1].
○ threadIdx.x will be set to the rank of the current thread within its thread block and
will be in the range [0,threads-1].
○ step = blockDimx.blockIdx.x+threadIdx.x is in range [0, threads
× blocks - 1].
The key point is that the system will run threads × blocks instances of the kernel on
the GPU covering all possible combinations of values for threadIdx and blockIdx.
Thus, when we look at a kernel listing we must imagine that we are looking at the contents
of a loop which is executed for all possible values of these built-in variables. In this
case step takes all values in the range [0,size-1] where size = threads ×
blocks. When looking at a kernel code, you must imagine that the code is being run
simultaneously by all the threads. Once you have mastered this concept, you will be a
parallel programmer!3
• Line 15.4: This is an out-of-range check on the value of step, the kernel will exit at this
point for threads that fail the check.
• Line 15.5: Calculate the x value corresponding to step.
• Line 15.6: Call sinsum with the thread dependant value of x. The result is stored in the
array sums using step as an index. sanet.st
• Line 15.7: The kernel exits at here; recall that return statements are not required for
void functions in C++.
• Lines 19.1–19.2: The two lines are added to define the kernel launch configuration
parameters threads and blocks. In this our first example, we use a fixed value
of 256 for threads and a calculated value for blocks which is set to be just big enough
to get the total number of threads in the launch to satisfy threads × blocks ≥
steps.
• Line 21.1: This line creates the array dsums of size steps in the device memory using
the thrust device_vector class as a container. By default the array will be initialised to
2.1 Flynn’s Taxonomy 27
zeros on the device. This array is used by the gpu_sin kernel to hold the individual
values returned by calls to the sinsum function.
• Line 21.2: We cannot pass dsums to the kernel directly as thrust was not designed
to make this possible,4 but we can pass a pointer to the memory array managed by the
class. For std::vector objects, the member function data() does this job. While
this function does work for thrust host_vector objects it does not work for
device_vector objects. Therefore we have to use the more complicated cast shown
in this line. As an alternative you could instead use the undocumented data().get()
member function of device_vectors.
22.1 gpu_sin<<<blocks,threads>>>
(dptr,steps,terms,(float)step_size);
• Line 22.1: This line shows our first CUDA kernel launch; this is basically just a function call
with a weird extra bit <<<blocks, threads>>> inserted between the function name
and its argument list. The int variables that appear here specify the number of threads in
each thread block (threads) and the number of thread blocks (blocks). Note these
values can be defined at run time and if you have multiple kernel launches in your code each
launch can use different values. As discussed above, threads should be a multiple of
32 and has a maximum allowed value of 1024 for all current GPUs.5 The second parameter
blocks should be large. These values affect the performance of the kernel and “tuning”
them to get the best performance involves trying different combinations and choosing the
combination that runs the kernel fastest. To aid this in most of our subsequent example code
we will make these launch parameters user settable command line parameters. For most
kernels a good starting point is <<<4*Nsm, 256>>> where Nsm is the number of SMs on
the target GPU.6 In this book we often use <<<288, 256>>> as our GPU has 36 SM
units. For testing or debugging purposes using <<<1, 1>>> is sometimes interesting and
is allowed. That version has the effect of running just a single thread on one SM unit.
• Line 22.2: Here we use the host callable reduce function in the thrust library to sum all
the elements of the array dsums in GPU memory. This call involves two steps, firstly we
perform the required additions on the GPU and secondly we copy the result from GPU
memory to CPU memory. This is often referred to as a D2H (device to host) transfer.
That is the end of our detailed description of our first kernel. We have deliberately kept this
code as simple as possible.
It is worth looking at line 15.1 of the code in more detail. For any particular thread at line
15.1 of the kernel function the CUDA system variable blockIdx.x is set to the number of
28 Thinking and Coding in Parallel
the currently executing thread block and the variable threadIdx.x is set to the rank
of that current thread within its thread block. Thus, in the present case where steps is set to
109 and threads is set to 256, blocks will be set to 3906251 and blockIdx.x will be in
the range [0,3906250], threadIdx.x will be in [0,255] and blockDim.x will be 256.
Thus, for each thread the calculation, line 15.1 produces a unique value for step in the
range 0–109 - 1. This is exactly what we need to replicate the behaviour of the original for
loop from Example 1.1. You will find something like line 15.1 in every CUDA kernel
function – it allows each thread to determine what specific task it has to do. In fact, the
variable step is nothing more than a unique thread id; this is often referred to as the rank of
the thread. NVIDIA also use the term lane to refer to the rank of a thread within its particular
32-thread warp.
There is another important point to make about line 15.1. The GPU hardware allocates
all the threads in any particular thread block to a single SM unit on the GPU, and these
threads are run together very tightly on warp-engines as warps of 32 threads. The variable
threadIdx.x is set so that threads in the same warp have consecutive values of
this variable; specifically threadIdx.x%32 is the rank or lane of a thread within its
warp (range 0–31) and threadIdx.x/32 is the rank of the warp within the thread
block (range 0–7 in our case). Thus, in line 15.6 of the kernel where we store a value in
sums[step], the adjacent threads within a given warp have adjacent values of step and
so they will address adjacent memory locations in the array sums. This is vital to make
efficient use of the GPU memory caching. Had we not known this we might have used the
formula:
sanet.st
step = threadIdx.x*gridDim.x+blockIdx.x;
in line 15.1. Since the variable gridDim.x is set to the number of thread blocks the
alternative would have given us the same range of values for the variable step but now
adjacent hardware threads have values of step separated by 3906251 – resulting in a
serious loss of memory performance.
In case you were wondering about the .x decoration, these variables are actually all
dim3 structs defined by the CUDA SDK (in vector_types.h). The type dim3 is a struct with
three const uint members x, y and z. In the next example we will see how they are
used in 3D grids of threads to best fit the needs of a particular problem. The CUDA SDK
defines a number of structs like this with up to four elements. For example, uint3 is similar
to dim3 except that uint3 defines overloaded operators to support basic arithmetic
operations whereas dim3 does not. Also, dim3 has a default constructor that initialises
all unspecified values to one so that it is always safe to use all components of the variables
even if the user has not explicitly set them. Although we used simple integers for our kernel
launch, the compiler will silently and safely promote them to type dim3 for the actual
kernel launch.
The values of threads and blocks are defined in lines 19.1 and 19.2 of the example to ensure
that there are at least steps threads so that we use a separate thread for each call to sinsum.
This may well be optimal here because the sinsum function does a great deal of computation.
But this code gives the user no chance to experiment to find out if that is true. A more general
approach is to allow the user to specify values for threads and blocks and to modify the
gpu_sin kernel to allow individual threads to make more than one call to gpu_sin if
necessary. Both modifications are very straightforward as shown in Example 2.1.
2.1 Flynn’s Taxonomy 29
. . .
15.1 __global__ void gpu_sin(float *sums, int steps,
int terms, float step_size)
15.2 {
15.3 int step = blockIdx.x*blockDim.x+threadIdx.x; // ID
15.4 while(step<steps){
15.5 float x = step_size*step;
15.6 sums[step] = sinsum(x,terms); // store value
15.65 step += gridDim.x*blockDim.x; // grid size stride
15.7 }
15.8 }
. . . // NB ternary operator (test) ? a : b used here
19.1 int threads = (argc > 3) ? atoi(argv[3]) : 256;
19.2 int blocks = (argc > 4) ? atoi(argv[4]) :
(steps+threads-1)/threads;
. . .
Our modifications to the kernel are changing the if in line 15.4 to while and inserting
an extra line 15.65 at the end of the while loop. In line 15.65 we increment step by the
total number of threads in the grid of thread blocks. The while loop will continue until steps
values have been calculated for all (non-zero) user supplied values of blocks and
threads. Moreover, and importantly for performance reasons, on each pass through the
while loop adjacent threads always address adjacent memory locations. Other ways of
traversing through the data could be devised but the one shown here is the simplest and best.
This technique of using a while loop with indices having a grid-size stride between passes
through the loop is called “thread-linear addressing” and is common in CUDA code. It
should always be considered as an option when porting a loop in host code to CUDA.
The added lines 19.1 and 19.2 of the main routine now also use the C/C++ ternary
operator (?:)and set the values of threads and blocks according to whether or not the user
has supplied extra command line arguments. If the user does not specify these arguments
then argc will be set to three or less and both tests will fail so both default values (after
the :) will be used. If the user specifies just one extra argument argc will be set to four so the
first test will succeed and threads will be set using the expression before the : which will
be the user supplied value. The second test will still fail and a default value for blocks will
be used in the calculation. Finally, if the user supplies both extra arguments then both
threads and blocks will be set using the user’s values.
We confess to being ambivalent about the (?:) operator; it is very terse and was
introduced in the early days of C in the 1970s when it was desirable to minimise keystrokes
on heavy mechanical teletype machines when inputting code. Careless use of this operator
can make code hard to read. However, crucially it returns a value whereas if statements do
not. Using (?:) allows us to declare and initialise a variable in the same statement which is
in line with modern C++ RAII practices. In our view this trumps the terse syntax of the
operator. We do use it in our examples when initialising a variable to one of two alternatives.
30 Thinking and Coding in Parallel
One drawback of this approach to reading command line arguments for setting program
options is that the user has to know the order in which the options are defined and cannot set
a given option without also specifying all the previous options. For production code we
would of course recommend something better.
For thread-linear addressing it is possible to replace the while loop Example 2.1 with a
for loop as shown in Example 2.2.
. . .
15.1 __global__ void gpu_sin(float *sums, int steps,
int terms, float step_size)
15.2 {
15.3 for(int step = blockIdx.x*blockDim.x+threadIdx.x;
step<steps; step += gridDim.x*blockDim.x){
15.5 float x = step_size*step;
15.6 sums[step] = sinsum(x,terms); // store value
15.7 }
15.8 }
Feel free to use either version but for me the first version using while seems clearer;
there is so much going on in the for statement in the second version that I find the code
sanet.st
harder to follow. A good compiler will generate identical code from either version.
01 #include "cuda_runtime.h"
02 #include "device_launch_parameters.h"
03 #include <stdio.h>
04 #include <stdlib.h>
07 __global__ void grid3D(int nx, int ny, int nz, int id)
08 {
09 int x = blockIdx.x*blockDim.x+threadIdx.x; // find
10 int y = blockIdx.y*blockDim.y+threadIdx.y; // (x,y,z)
11 int z = blockIdx.z*blockDim.z+threadIdx.z;
12 if(x >=nx || y >=ny || z >=nz) return; // range check
Using a global declaration is actually an easy way to create GPU arrays of known size, but we will
rarely use it – there are two important disadvantages. Firstly, the array dimensions must be set at
compile time not run time and secondly declaring variables with file scope is a deeply depreciated
programming style because it leads to unstructured code where functions can easily cause unwanted
side effects. In our subsequent examples we will allocate arrays in code and then pass them as pointer
arguments to called functions as necessary.
• Line 7: The kernel grid3D is declared with four arguments which are the array dimensions and id
which specifies the thread whose information will be printed.
• Lines 9–11: Here we calculate the thread’s x, y and z coordinates within its thread block. The launch
parameters defined in lines 35–36 set the block dimensions to 32, 8 and 2 and the grid dimensions to
16, 64 and 128 for x, y and z respectively. This means that in line 9 the built-in variables
blockDim.x and gridDim.x are set to 32 and 16 respectively. Thus threadIdx.x and
blockIdx.x will have ranges [0,31] and [0,16] and the desired coordinate x will have the
range [0,511] which is required to index the global arrays a and b. Similarly, y and z have ranges
of [0,511] and [0,255]. Within any particular thread block the threadIdx values will have
ranges of [0,31], [0,7] and [0,1] for x, y and z; note the x range corresponds to one
2.3 3D Kernel Launches 33
complete warp of threads; this is a design choice not chance. Having decided to use an x range of
32 we are restricted to smaller ranges for y and z as the product of all three is the thread block size
which is limited by hardware to a maximum of 1024.
• Line 12: This is an out-of-range check on the calculated indices. This check is not strictly necessary
here as we have carefully crafted the launch parameters to exactly fit the array dimensions. In
general, this will not always be possible and it is good practice to always include range checks in
kernel code.
• Lines 13–19: Calculate some values derived from the launch parameters. Most of these values
would not be needed in a real-world problem, but we want to print them to illustrate the detail of 3D
addressing in kernels.
○ Lines 13–15: The 3D array, thread block and grid sizes are simply calculated as the product of
their dimensions.
○ Line 16: Similarly the total number of threads is the product of the thread block size
3D Rank Formula
rank = (z*dim_y + y)*dim_x + x
for a 3D array of dimensions (dim_x, dim_y, dim_z) laid out sequentially in memory with
the x values adjacent, the y values are separated by stride of dim_x and the z values are
separated by a stride of dim_x*dim_y. We will use versions of this formula very often in our
examples, often encapsulated in a lambda function.
○ Line 18: Here we also use the rank formula to calculate the rank of the thread block within the grid
of thread blocks.
○ Line 19: Here we use the 2D version of the rank formula to calculate the rank of the thread within
quantities.
• Lines 32–39: Here is the complete short main routine. Basically, we get a user settable value for id
in line 34, set the kernel launch parameters in lines 35–36 and launch the kernel in line 37.
The results of running Example 2.3 are shown in the box below. There are 2 cases
shown.
• Case id=511: This is the last thread in the first block which spans the range: [0-31,0-7,
0-1] and the last point in this range is (31,7,1) which is shown correctly as the index
[1][7][31] in the figure.
• Case id=1234567: To understand this we need to realise that a set of 16 blocks will span
the complete x range for eight consecutive y and two consecutive z values. Hence the first
1024 blocks will span the range [0-511,0-511,0-1] which is two complete x-y
slices of the array, The next 1024 blocks will span the slices with z in range [2-3] and so
on. Since 1234567 = 512*2411+135 we have picked the 135th thread in the 2412th
block. The first 4 x-y slices account for 2048 blocks so our pick is in the 364th block in the
34 Thinking and Coding in Parallel
4–5 slice pair. Next since 364 = 22*16 + 12 we conclude that our thread is in the 12th
block in the set of 16 blocks that spans the index range [0-511,168-175,5-6]. This
12th block spans [352-383,176-183,5-6] and since the 135th thread is offset by
[7,4,0] from this position we find an index set of [359,180,5] or a C/C++ 3D
vector index address of [4][180][359].
As our second case illustrates 3D thread blocks are somewhat complicated to visualise but
their unique selling point is that they group threads spanning 3D subregions of the array into
a single SM unit where the threads can cooperate. In many volume processing applications,
for example, automatic anatomical segmentation of 3D MRI scans, this is a key advantage.
In practice, addressing such a subregion directly from the GPU main memory is often
inefficient due to the large strides between successive y and z values. In such cases caching
a 3D subregion in shared memory on the SM is an important optimisation.
However, if threads in your kernel only process individual elements of the array with little
collaboration between threads then 1D thread-linear address is simpler to implement and
offers more scope for tuning the launch configuration. Example 2.4 shows a version of the
grid3D kernel with 1D thread-linear addressing.
01 #include "cuda_runtime.h"
02 #include "device_launch_parameters.h"
03 #include <stdio.h>
2.3 3D Kernel Launches 35
04 #include <stdlib.h>
07 __global__ void grid3D_linear(int nx, int ny, int nz, int id)
08 {
09 int tid = blockIdx.x*blockDim.x+threadIdx.x;
Results from example 2.4 using the grid3D_linear kernel to process 3D arrays with thread-linear-
addressing. The displayed array element has different 3D indices as compared to example 2.2 even
though its linear index is the same as used in that example.
Calculation of 3D coordinates by extraction of bit fields from thread-linear address. The two 9-bit
masks are for case where both nx and ny are equal to 512.
The formulae used to convert between a 3D (x,y,z) index triad and a linear index are shown in the
box. Lines 15–17 here are an example of this:
x = index % nx
y = (index / nx) % ny
z = index / (nx*ny)
2.4 Latency Hiding and Occupancy 37
• Lines 19–28: This is similar to the previous example except we are using the variable name tid
instead of thread_rank_in_grid.
• Line 29: Here we increment tid using a stride equal to the length of the entire thread-grid
• Line 30: Here we increment a counter pass and continue to the next pass of the while loop. The
variable pass is only used as part of the information printed. The actual linear address being used
by a given tid within the while loop is rank_in_grid+pass*total_threads.
• Lines 33–40: The main routine now accepts two additional user arguments blocks and
threads, which define the kernel launch parameters.
The results for the thread with a linear index of 1234567, the same value as used in Example 2.2,
shows that this linear index corresponds to a 3D element [4][363][135] whereas in Example 2.2 using
3D grid and thread blocks it corresponded to the element [4][180][359]. Neither result is “wrong”. The
difference merely reflects the different order in which elements of the arrays are encountered.
Next we return to the CUDA topic of occupancy.
Thread Blocks Blocks per Grid if Registers per Bytes of shared memory
Block Size per SM GPU has 20 SMs Thread per Thread Block
32 64 1280 32 1 KB
64 32 640 32 2 KB
128 16 320 32 4 KB
256 8 160 32 8 KB
512 4 80 32 16 KB
1024 2 40 32 32 KB
from T1’s first memory request is hidden by the initial running of T2–T4. After a small
amount of additional idle time T1’s data arrives in L1 cache and T1 resumes execution
sanet.st
using this data, T1 then stalls again with a second memory access shown as the grey
vertical line. However, now there is no further idle time as data will now reach the L1
cache before it is needed.
On the Pascal architecture each SM has four warp-engines each capable of running up to
16 active warps, or equivalently 64 warps or 2048 threads on the SM. Thus, if the global
memory latency is 400 cycles, each thread would need to do only 25 cycles worth of
computation between memory accesses to fully hide this latency. This is a best-case situation
because the kernel launch configuration may restrict the maximum number of active warps
to less than 16. The occupancy of a kernel is defined as:
The factors which may limit occupancy are the thread block size, the number of thread
blocks, the number of registers used by each thread and the amount of shared memory used
by a thread block. The number of thread blocks should be an integer multiple of the number
of SMs in the GPU sufficient to give 2048 threads per block. There are also limits of 64K
32-bit registers and 64 KB of shared memory per SM. These limits are illustrated in
Table 2.2 for a Pascal GPU with 20 SMs. Note, the number of registers allocated to each
thread is determined by the compiler and depends on the complexity of the code. For full
occupancy, the hardware has sufficient registers for 32 registers per thread. This value can be
inspected and/or overridden by using NVCC compiler switches --maxrregcount and
--resource-usage.
2.5 Parallel Patterns 39
Full occupancy is more important for memory bound kernels than it is for compute bound
kernels but it is always a good idea to keep your kernel code compact and straightforward as
this will allow the compiler to allocate registers more effectively. It is also a good idea to split
long calculations into stages and use separate kernels for each stage. Remember that the
contents of GPU global memory is preserved between kernel launches.
It is now time to move on to more interesting GPU code where threads have to actively
cooperate to perform a calculation.
if (flag == 0) function1(a1,a2,...);
else function2(b1,b2,...);
If all the 32 threads in a particular warp have flag=0, then all threads will call
function1 and there is very little performance loss, the same is true if none of the
32 threads have flag set to zero. However, if even just one of these threads has flag set to
a non-zero value while the other 31 threads have flag=0 then we get a so-called branch-
divergence. The system handles this by serializing the calls to the two functions, that is, the
subset of threads in the warp with flag=0 execute the call to function1 while the threads
having flag non-zero stay idle. Then, when the function has returned for all active threads,
the else clause calling function2 is executed by the previously idle threads while
previously active threads are now idle.7
If the functions concerned are modest and require only a small fraction of the kernel’s
execution time, no great harm is done, but otherwise there can be up to a factor two drop in
40 Thinking and Coding in Parallel
performance. If the called functions also have branch divergences the performance penalty is
even worse.
If you have not encountered parallel programming on GPUs before, the need to remove all
if statements from your code may seem like a deal breaker – but as we shall see in our
examples this can be achieved quite straightforwardly in many cases. I also have to confess
that I enjoy the intellectual challenge of designing good GPU code.
The GPU implementation of this algorithm is shown in Example 2.5. The host code which
initialises the data and manages most of the calculation is also shown and discussed in
this example.
01 #include "cx.h"
02 #include "cxtimers.h"
03 #include <random>
12 thrust::host_vector<float> x(N);
13 thrust::device_vector<float> dev_x(N);
19 cx::timer tim;
20 double host_sum = 0.0; // host reduce!
21 for(int k = 0; k<N; k++) host_sum += x[k];
22 double t1 = tim.lap_ms();
• Line 18: The contents of x are copied from the host to dev_x on the GPU. The details of the
transfer are handled by thrust.
• Lines 19–22: A timed loop to perform the reduction on the host using a simple for loop.
• Lines 24–31: Implement the GPU-based parallel iteration of Algorithm 1. For each pass through the
for loop the reduce0 kernel called in line 28 causes the top half of the array dev_x to be
“folded” down to an array of size m by adding the top m elements to the bottom m elements. The last
pass through the loop has m=1 and leaves the final sum in dev_x[0]; this value is copied back to
the host in line 35. Lines 28–29: Within the for loop the kernel launch parameters blocks and
threads are set so that the total number of threads in the grid is exactly m. This code will fail if
N is not a power of 2 due to rounding down errors at one or more steps in the process.
In CUDA programs a kernel launch such as that used in line 28 will not block the host which will
proceed to the next line of the host program without waiting for the kernel call to finish. In this case
that means all the kernel calls (23 in all for N=224) will be rapidly queued to run successively on the
GPU. In principle the host can do other CPU work while these kernels are running on the GPU. In this
case we just want to measure the duration of the reduction operation so before making the time
measurement we must use a cudaDeviceSynchronize call in line 30 which causes the host to
2.6 Parallel Reduce 43
wait for all pending GPU operations to complete before continuing. This kind of synchronisation issue
often occurs in parallel code.
• Lines 32–33: Here we copy the final sum in the dev_x[0] back to the host, again using thrust, and
print results.
The bottom line shows the results obtained running this program with the default value of 224 for the
number of values to be summed. Note the kernel execution time of 0.535 ms is too short a single
measurement to be reliable. The values shown in these reduce examples were in fact obtained as
averages of 10,000 runs using a for loop around kernel calls. An alternative method would be to use
the Nsight Compute profiling tool, but our simple host-based method using cx::timer is a good
starting point.
An interesting feature of the results obtained is that the host calculation uses a 64-bit
double variable to accumulate the sum of the x values but the GPU does not. However, the
results differ by only 1.1 parts in 10‒8 – this is about the best that can be expected from a
32-bit floating point calculation; rounding errors have not accumulated in the GPU calcula-
tion. On the other hand, if we change the variable host_sum (line 23 of the host calculation
in Example 2.4) to a float instead of a double the accuracy of the host calculation falls to
only about 3 parts in 10‒5 thus rounding errors do accumulate in the host calculation. This
difference is due to the fact that the GPU accumulates many intermediate partial sums and
thus tends to be always adding numbers of similar sizes. Although this improvement is data
dependent, this is encouraging to see, as we plan to use 32-bit floats in our GPU calculations
whenever possible.
While accurate, our kernel is very inefficient and unlike the compute bound problem in
Chapter 1, reduction is a memory bound problem and the reduce0 kernel does not handle
this well.
Firstly, the only calculation done by each thread is a single addition, and secondly the
statement:
x[tid] += x[tid+m],
triggers three global memory operations, namely loading both the values stored in x[tid]
and x[tid+m] into GPU registers and then storing the sum of these values back into
x[tid]. If we could accumulate partial sums in local registers, that would reduce the
number of global memory accesses needed for each addition down to one, which offers a
speed-up by a potential factor of three.
Secondly, the host calls the kernel iteratively, halving the array size at each step to
complete the reduction process, leaving the final sum in the first array element. The effect
of this is to double the number of times the x[tid] += x[tid+m] statement is
performed. If we could instead perform the iteration inside the kernel that could also reduce
the number of memory accesses required.
Finally, the kernel of Example 2.5 is not general enough, the array size must be a power of
2 and the host has to make multiple calls to the kernel using a carefully crafted sequence of
launch parameters. A better solution is to use thread-linear addressing, with user defined
values of blocks and threads to get something like Example 2.6:
44 Thinking and Coding in Parallel
The reduce1 is about twice as fast as reduce0 which is not a bad start but we can do
more. Our reduce1 kernel is also much more user friendly, it can cope with any value of
the input array size N and the user is free to tune the launch configuration parameters
blocks and threads.
Notice that in the last reduce step, line 27 of Example 2.6, we used a single thread running
alone to sum threads values stored in x. We can do better than this by getting the threads
in each thread block to cooperate with each other.
A key feature of NVIDIA GPUs is shared memory which allows the threads within a thread
block to cooperate efficiently. A thread block running on the GPU can reserve an allocation of
shared memory, and all threads in the thread block can then read from and write to that
memory. Threads in different thread blocks of a kernel get different allocations of shared-
memory and cannot read or write to each other’s allocations. Each SM unit has a pool of
shared memory which is divided between all the currently resident thread blocks that request
shared memory. Of course, device global memory is also visible to all threads on all thread
blocks but accessing shared memory is much faster and can be as fast as using registers.
NVIDIA GPUs have at least 48 KB of shared memory per SM unit and more recently
64K. The precise amount depends on the CC level and is summarised in Table 2.3. It is clear
from the table that at the thread level shared memory is a scarce resource; there is enough for
only 8 or 16 4-byte words per thread. If more is required one can try using more shared
memory per thread at the expense of lower occupancy. The runtime system will automatic-
ally run fewer kernels per SM if necessary. If a kernel requests more shared memory than the
total available on an SM then the kernel launch will fail. Shared memory featured very
prominently in early CUDA tutorials and books because these GPUs had little or no caching
capability for global memory accesses. More recent GPUs have good L1 and L2 caching
and in particular L1 caching can sometimes work well as an alternative to using shared
memory. Interestingly devices with CC 7.0 and above have a single fast memory resource
that can be shared between L1 and shared memory partitions in different proportions for
different kernels.
In our next Example 2.7 we use shared memory to enable the threads in each thread
block to sum their individual accumulated totals and then write a single word with the block-
sum to external memory. The scheme of Figure 2.2 is again used for this intra-
block reduction.
46 Thinking and Coding in Parallel
04 int id = threadIdx.x;
05 int tid = blockDim.x*blockIdx.x+threadIdx.x;
06 int stride = gridDim.x*blockDim.x;
07 tsum[id] = 0.0f;
08 for(int k=tid;k<N;k+=stride) tsum[id] += x[k];
09 __syncthreads();
// power of 2 reduction loop
10 for(int k=blockDim.x/2; k>0; k /= 2){
11 if(id<k) tsum[id] += tsum[id+k];
12 __syncthreads();
13 }
// store one value per thread block
14 if(id==0) y[blockIdx.x] = tsum[0];
15 }
29 cx::timer tim;
30 double host_sum = 0.0; // host reduce!
31 for(int k = 0; k<N; k++) host_sum += x[k];
32 double t1 = tim.lap_ms();
37 cudaDeviceSynchronize();
38 double t2 = tim.lap_ms();
39 double gpu_sum = dx[0]; // D2H copy (1 word)
40 printf("sum of %d numbers: host %.1f %.3f ms
41 GPU %.1f %.3f ms\n",N,host_sum,t1,gpu_sum,t2);
42 return 0;
43 }
• Lines 10–13: This is the implementation of the power of 2 reduction scheme of Figure 2.2 implemented to
sum the values in tsum on a thread block. This section of code assumes that blockDim.x is a power
of 2. Note that the number of active threads reduces by a factor of 2 on each pass through the for loop.
Older tutorials tend to dwell on further optimisation of this loop by explicitly unrolling and exploiting
synchronicity within 32-thread warps. This will be discussed in the next chapter on cooperative groups.
For now, note further optimisation of this loop is only important for smaller datasets.
• Line 14: The final block sum accumulated in tsum[0] is stored in the output array y using
blockIdx.x as an index.
• Lines 16–45: This is the main routine; much of it is similar to the previous example and here we will
just mention differences.
• Lines 18–20: Here we give the user the option to set the array size N and the launch parameters
blocks and threads. Note blocks needs to be a power of 2 for the reduce2 kernel to
work properly.
• Line 23: We now allocate a device array dy having dimension blocks. This new array will hold
the individual block wide reduction sums.
• Line 35: Here we call the reduce2 kernel for the first time to process the whole dx array with the
block sums being stored in the output array dy. Note the third kernel argument requesting a shared
memory allocation of threads 4-byte floats for each active thread block. A large value here may
result in reduced occupancy.
• Line 36: Here we call reduce2 again but with the array arguments swapped round. This has the result
of causing the values stored in y by the previous kernel call, to themselves be summed with the total
placed in x[0]. This requires a launch configuration of a single thread block of size blocks threads.
The result at the end of the listing shows that reduce2 is about 2.65 times faster than reduce0.
sanet.st
A worthwhile optimisation of the reduce2 kernel would be to drop the restriction that blocks
must be a power of 2. This is because in many GPUs the number of SM units is not a power of 2.
For example, my GPU has 36 SMs so to keep all SMs equally busy it is better to use 288 rather than
256 for the number of user set value of blocks. We can do this by replacing blockDim.x in
line 10 of the reduce2 kernel by the smallest power of 2 greater than or equal to blocks. For
blocks = 288 this would be 512. The effect of doing this is that in the first pass when k=256,
threads with rank 0 to 31 will add values from tsum[256] to tsum[287] to their tsum
values. We also have to add an out-of-range check to prevent threads 32-255 from attempting
out-of-range additions. The modified reduce3 kernel is shown in Example 2.8.
04 int id = threadIdx.x;
05 int tid = blockDim.x*blockIdx.x+threadIdx.x;
06 int stride = gridDim.x*blockDim.x;
07 tsum[id] = 0.0f;
08 for(int k=tid;k<N;k+=stride) tsum[id] += x[k];
09 __syncthreads();
2.6 Parallel Reduce 49
• Line 10.1: Here we add a new variable block2 which is set the value of blockDim.x rounded up to
the lowest power of 2 greater than or equal to blockDim.x. We use the cx utility function pow2ceil
for this. That function is implemented using the NVIDIA intrinsic function __clz(int n) which
returns the number of the most significant non-zero bit in n. This is a device-only function.
• Line 10.2: This is the same as line 10 in reduce2 except we use the rounded up block2/2 as the
starting value of k.
• Line 11: This corresponds to line 11 of reduce2 with an added out-of-range check on id+k.
In the last line we see that launching this kernel with exactly 8 thread blocks per SM gives a speed-up
of 2.73 compared to reduce0, slightly better than reduce2.
The reduce3 kernel is about 70 times faster than the single core host version. While this
is not quite as spectacular as our Chapter 1 result for a CPU bound calculation, reduction is a
memory bandwidth bound calculation with just one add per read of 4-bytes of memory so we
expect reduced performance. Given that the GPU memory bandwidth is only about 10 times
that of the CPU the factor 70 improvement shows that other GPU features including the
latency hiding are helping speed up this memory bound problem. The last trick to try is
explicitly unrolling the loop in lines 10–13.
04 int id = threadIdx.x;
50 Thinking and Coding in Parallel
Note there is a __syncthreads after each step in lines 10–13. These calls are necessary to ensure that
all threads in the thread block have completed their addition before any of them proceed to the next step.
• Lines 15–19: These lines are the final five steps in the parallel reduction tree. In these lines only
the first 32 threads participate. These threads are all in the same warp so we can replace
2.7 Shared Memory 51
__syncthreads with the much faster __syncwarp. For devices of CC < 7 all threads in the
same warp act in strict lockstep so here it is possible to rely on implicit warp synchronisation and
omit the __syncwarp calls entirely. You will find this done in early (now deprecated) tutorials.
Even if you only have access to older devices, we strongly recommend that you always use
syncwarp where it would be necessary on newer devices to maintain code portability.
The result shown in the last line shows at best a tiny improvement compared to reduce3.
The performance difference between reduce3 and reduce4 is small but reduce4 has
introduced us to warp level programming. We will return to the reduce problem in the next
chapter and show how warp-based programming can be taken much further.
Next we will discuss shared memory in more detail and then explore another application,
namely matrix multiplication.
This is arguably the simplest method but lacks flexibility. Shared memory array sizes
typically depend on the number of threads in the thread block, thus if they are fixed at
compile time then so is the size of the thread block.
Dynamic shared memory allocation is an alternative where the kernel does not specify the
size of an array but declares an externally allocated shared pointer, for example:
In this case the actually required memory size is specified by the host at kernel launch time
using the value (in bytes) as third kernel launch third parameter. Since this value can be a
variable determined during program execution this method of memory allocation known as
dynamic. Examples 2.4 and 2.5 use this method.
52 Thinking and Coding in Parallel
Static memory declarations are usually placed at the start of your kernel code, but this is
not mandatory; obviously like all variables, their declaration needs to precede their use.9
More than one shared array or variable can be declared in a single kernel, the static allocation
case is straightforward, for example:
will work as expected creating separate arrays with a total memory requirement of
1024+512 bytes. However, the corresponding dynamic allocation in kernel code:
will compile successfully without warnings but both arrays will start at the same address –
namely the starting address of the reserved block of shared memory that is allocated when
the kernel is runs. Although annoying and bug prone, this is the only possible thing the
compiler can do since it does not know the array sizes at compile time. Thus, during
execution, writing to either array will write to both, leading to kernel failure or (worse) hard
to find bugs. In order to fix this problem, we need to modify the kernel code so that only one
array is declared extern and all other arrays are declared as pointers with appropriately
calculated offsets from the extern array as shown in Example 2.10.
09 int id = threadIdx.x;
10 sx[id] = 3.1459*x[id];
11 su[id] = id*id;
12 sc[id] = id%128;
// do useful work here
. . .
30 int threads = (argc >1) ? atoi(argv[1]) : 256;
31 int blocks = (size+threads-1)/threads;
32 int shared = threads*(sizeof(float) +
sizeof(ushort) + sizeof(char));
33 shared_example<<< blocks, threads, shared >>>
(dx_ptr,dy_ptr,size);
2.8 Matrix Multiplication 53
• In line 6: A single dynamically allocated shared memory array sx of type float is declared. Note
that sx is just a C style pointer to an array of floats. We could have used “float *sx;” instead of
“float sx[]”
• Lines 7–8: Here pointers to two additional arrays, su and sc, are declared using pointer arithmetic
to calculate their offsets from the start of sx. In line 7 the su pointer is set to the address after
blockDim.x floating point elements of the array sx and then cast to the ushort pointer type.
Similarly, in line 8 the sc pointer is set to the address after blockDim.x ushort elements of the
array su and then cast to the char type.
• Lines 9–12: Here we demonstrate use of the arrays, the variable id is set to the current threads’s
rank in the thread block and then used normally to index the three arrays.
• Lines 30–33: These show a fragment of the corresponding host code containing the kernel.
○ Line 30: The launch parameter threads is set using an optional user supplied value.
○ Line 31: The parameter blocks is then set as usual. in lines 30–31.
○ Line 32: A third launch parameter shared is set in line 32. The value stored in shared is
calculated as the total number of bytes necessary for the three arrays.
○ Line 33: This shows the kernel launch using three parameters in the launch configuration.
One subtle detail of this example is that the calculation made in line 32 makes no
allowance for memory “gaps” between the arrays that might be needed for natural alignment
of each array on memory boundaries. However, because the declarations and assignments in
lines 5–8 of the kernel go from the longest variable type (4-byte floats) to the shortest
variable type (1-byte chars), natural alignment will be achieved for all three arrays without
the compiler needing to introduce gaps.10
Simple variables can also appear in dynamically allocated shared memory, but since
their size, namely sizeof(variable type), is known at compile time, static
allocation is the best choice. If the variable is intended to contain some parameter which
is read but not changed by the threads, then using constant memory might be a
better choice. Note that constant memory will be automatically used for most kernel
arguments.
As mentioned above 2D arrays can be implemented in numerous ways, but the most
efficient method is to use a single contiguous memory block and address elements using the
2D version of the rank formula given above, namely:
In this scheme the column index j is the “hot” index, j and j+1 refer to adjacent memory
locations whereas i and i+1 refer to memorysanet.st
locations separated by a stride Ncols. In most
of our examples we will use thrust as a container class for both host and device arrays.
We are fully aware that C++ provides nice tools to create beautiful containers for vectors
and matrices with support for more elegant addressing schemes such as the Fortran like
A(i,j) or the C multidimensional A[i][j] style. Indeed we had intended to adopt one of
these wrappers when we began this project. However at the time of writing any attempt to
pass any object other than a bare pointer to a CUDA kernel prevents any optimisations based
on using __restrict and since our goal is fast code we will stick with the simple explicit
address calculation as shown in the box.11 Example 2.11 shows a straightforward imple-
mentation of matrix multiplication on the host.
01 #include "thrust/host_vector.h"
02 #include "cxtimers.h"
03 #include <random>
21 thrust::host_vector<float> A(Arow*Acol);
22 thrust::host_vector<float> B(Brow*Bcol);
23 thrust::host_vector<float> C(Crow*Ccol);
30 hostmult0(C.data(),A.data(),B.data(),Arow,Acol,Bcol);
31 double t1 = tim.lap_ms();
32 double flops = 2.0*(double)Arow*(double)Acol*
(double)Bcol;
33 double gflops= flops/(t1*1000000.0);
34 double gbytes = gflops*6.0; // 12 bytes per term
35 printf("A %d x %d B %d x %d host time %.3f ms
Gflops/sec %.3f\n",Arow,Acol,Brow,Bcol,t1,gflops);
36 return 0;
37 }
D:\ >hostmult0.exe
A 1024 x 1024 B 1024 x 1024 host time 2121.046 ms
GFlops 1.013 GBytes 6.076
sizes of all three matrices. Note we use y and x instead of row and col to denote the first and second
dimensions of the matrices. Thus, A is ay ax, B is ax bx and C is ay bx, we infer the first
dimension of B and both dimensions of C from the properties of matrix multiplication.
• Lines 7: These for loops over i and j cover all the elements of the desired product C.
• Line 9: The inner loop over k implements the summation from the standard formula. You can think
of this summation as a dot product between the ith row of A and the jth column of B. Notice how the
array indices vary with the for loop index k. The factor A[i*Ax+k] behaves “nicely” because as
k increments, it addresses elements of A which are adjacent in memory, this is optimal for caching.
On the other hand, the factor B[k*Bx+j] addresses memory with a stride of Bx words between
successive values of k, which gives poor cache performance. This problem is inherent in matrix
multiplication and has no simple fix.
Notice also that a triple for loop is needed for matrix multiplication. If the matrices have
dimensions of 103 then a total of 2 109 arithmetic operations are required – multiplication
of big matrices is slow!
You might worry that the expressions like i*Bx+j used for the array indices add a significant
computational load for each step through the loop. In fact, this sort of index expression is so
common that compilers are very good at generating the best possible code for indexing such
arrays efficiently.
(Arow & Acol) and the number of columns of B (Bcol). The dimensions of C and number of
rows of B are set to be compatible with matrix multiplication.
○ Lines 21–23: Here we allocate thrust vectors to hold the matrices.
sanet.st
○ Lines 24–28: Here A and B are initialised with random numbers.
○ Lines 32–35: Print some results, the performance in GFlops/sec assumes two operations per
The performance of this code is quite poor but we can improve it significantly by adding
the C++11 __restrict keyword to the pointer argument declarations in line 9. This is
shown in Example 2.12 where only line 9 from Example 2.11 has been changed.
. . .
04 int hostmult1(float * __restrict C, float * __restrict A,
float * __restrict B, int Ay, int Ax, int Bx)
. . .
D:\ > hostmult1.exe
A 1024 x 1024 B 1024 x 1024 host time 1468.845 ms
GFlops 1.462 GBytes 8.772
2.8 Matrix Multiplication 57
We have improved the performance by 44% making this simple change! If you are not
familiar with the history of C/C++ this requires some explanation. When a function is
declared with a simple pointer argument, the compiler has no way of being certain that there
are no other pointers to the same memory location elsewhere in the code. This is called
pointer aliasing and in the early days of C when computer memory was a scarce resource,
people would deliberately use pointer aliasing to use the same piece of memory for different
purposes at different stages in the program. Needless to say, this practice often resulted in
hard-to-find bugs. On modern systems with 64-bit memory addressing pointer aliasing is
completely unnecessary yet the memory of old practice lingers on in modern compilers
which are still reluctant to fully optimise code involving simple pointers. Specifically, they
will tend to unnecessarily store intermediate results back to main memory rather than using
registers. Adding the restrict qualifier to a pointer declaration tells the compiler that the
pointer is not aliased and aggressive optimisation is safe.12
The CUDA NVCC compiler also supports restrict and the performance of many
kernels does indeed improve when it is used. Thus, we come to the conclusion shown in
the box:
As mentioned above, in practice C++ compiler support for restrict is quite shallow; if the
restrict pointer is passed as a function argument wrapped in even a simple C++ class then
restrict has no effect. In fact, while restrict is officially part of modern C11, it is not part of
any C++ standard up to C++17. Fortunately, most recent C++ compilers including Visual
Studio and g++ do support restrict, albeit in a shallow form. Another issue is that while the
C standard uses restrict without decoration, C++ compilers may use _restrict or
__restrict instead.
At this point we should mention another qualifier, const; many books on C++ get very
excited about this. We discuss it in more detail in our C++ coding appendix. In principle use
of const can allow the compiler to further optimise code. In practice we find using const
does not usually give much or any performance gain; its use is actually more important as a
safeguard to prevent accidental overwriting of variables and to make a programmer’s
intentions clear. In the case of pointers there are four possibilities shown in Table 2.4.
Table 2.4 Possible combinations of const and restrict for pointer arguments
Example
int hostmult1(r_Ptr<float> C, cr_Ptr<float> A, cr_Ptr<float> B, int Ay,
int Ax, int Bx)
58 Thinking and Coding in Parallel
The middle column of the table shows templated wrappers defined in cx.h that can be
used to hide the gory details in the first column. An example of their use is shown in the
bottom row. These wrappers can be used in both host and kernel code and the first two will
be used in most of our examples from now on.
It is now time to look at a GPU version of matrix multiply; to get the best performance is
actually quite complicated but we will start with a simple approach as shown in Example 2.13.
21 thrust::host_vector<float> A(Arow*Acol);
22 thrust::host_vector<float> B(Brow*Bcol);
23 thrust::host_vector<float> C(Crow*Ccol);
23.1 thrust::device_vector<float> dev_C(Crow*Ccol);
23.2 thrust::device_vector<float> dev_A(Arow*Acol);
23.3 thrust::device_vector<float> dev_B(Brow*Bcol);
• Lines 4–11: The GPU kernel gpumult0 replaces the previous hostmult0 function here. The
kernel is designed to use one thread to calculate one element of the matrix product. The kernel
expects to be called with a 2D grid of thread blocks with sufficient threads in the x and y dimensions
to span all the elements of C. As before x is the column index and y is the row index.
• Lines 6–7: Here we set tx and ty from the built-in variables to determine which element of C this
thread will calculate. These lines effectively replace the loops over i and j used in the host version,
we can think of the kernel as effectively calculating all the elements of C in parallel.
• Line 8: This is an out-of-range check on tx and ty. It is necessary because the dimensions of each
thread block may have been rounded up.
• Lines 9–10: Here we calculate one element of C using the standard formula. Notice the factor
B[k*Bx+tx] in line 10 still uses a memory stride of Bx words on successive passes through the
for loop over k. But now in this parallel kernel adjacent threads will use adjacent elements of
B because adjacent threads have adjacent values of tx. Thus L1 caching will be efficient for both
factors in the multiplication – this is an interesting example of how parallel CUDA code can provide
efficient memory access in situations where single threaded code struggles.
• Lines 20.1–20.2: We add two additional user settable parameters tilex and tiley which define
the x and y dimensions of the thread blocks used by the kernel launch. These are equivalent to the
threads and blocks parameters we use in many 1D examples.
60 Thinking and Coding in Parallel
• Lines 23.1–23.3: Here we allocate device arrays to hold copies of the matrices A, B and C.
• Lines 28.1–28.2: Copy A and B to the device.
• Line 28.3: Set threads to a dim3 triad representing a 2D tile on the matrix C.
• Line 28.4: Set blocks as a dim3 with x and y dimensions sufficient for the thread block tiles in
threads to span the matrix C. Notice the integer rounding up for cases where the dimensions of C are
not exact multiples of tilex and tiley. The out-of-range test in line 8 is necessary for cases
where rounding up was needed. Rounding up and consequent testing in kernels are very common in
CUDA code written to process general cases where not everything is a power of 2.
• Lines 29–31: This timed loop is similar to that of Example 2.9 but performs a kernel launch instead
of a host function call. The use of cudaDeviceSynchronize is necessary for timing purposes.
• Line 31.1: Here we copy the result back to the host. Although C is not used in the code shown here,
it would obviously be used in real-world code. Indeed, we have used C to compare the results from
the host and GPU versions and find the calculated C ij agree to about 6 significant figures.
The timing result in the last line shows that there is an impressive speed-up of about 220 times
compared to the host calculation in Example 2.12.
If we change the gpumult0 declaration in line 4 to use restrict we get the gpumult1
kernel declaration shown in Example 2.14. This example shows two alternative methods of
declaring restrict arrays in kernel code; the first method just uses C++ keywords and is quite
verbose; the second method uses cx defined abbreviations for the same result. Note that
cr_Ptr and r_Ptr are defined with templated using statements in cx.h. We will use
the abbreviated versions in all our later examples.
sanet.st
. . .
04 __global__ void gpumult1(float * __restrict C,
const float * __restrict A, const float * __restrict B,
int Ay, int Ax, int Bx)
or:
04 __global__ void gpumult1(r_Ptr<float> C, cr_Ptr<float> A,
cr_Ptr<float> B, int Ay, int Ax, int Bx)
We can see that simply using restrict on our GPU matrix multiply code gives a dramatic
speed-up of more than a factor of 2.6 (compared to about 1.4 for host code). The effective
memory bandwidth is also much greater than the hardware limit of about 400 GBytes/sec
separately for read and write, demonstrating that memory caching is playing an important role.
Finally, if you really hate the explicit address calculations in line 10 of Example 2.13 they
can be hidden using a lambda function as shown in Example 2.15.
2.9 Tiled Matrix Multiplication 61
Example 2.15 gpumult2 kernel using lambda function for 2D array indexing
// lambda function
09 auto idx = [&Bx](int i,int j){ return i*Bx+j; };
10 C[idx(i,j)] = 0.0;
11 for(int k=0;k<Ax;k++)
C[idx(i,j)]+= A[idx(i,k)]*B[idx(k,j)];
12 }
In Example 2.15 we have added a new line 9 which defines a local function idx that
performs the standard 2D address calculation needed for this function. The span needed step
to successive rows of B and columns of C is Bx, the number of columns of B and
(necessarily) also the number of rows of C. The value is captured by the lambda function
using the [&Bx] syntax to indicate that the variable Bx used in the body of the lambda
function is the same variable as used in the main body of the surrounding function.
Moreover, by prefixing & to Bx we indicate that the variable is to be used by reference with
no copy required. This should lead to the compiler generating code identical to Example
2.14, and indeed we find no performance difference between these two versions. Using a
lambda function in this way is modernising the old trick of using a macro; in this case:
which achieves the same effect. However, we deeply deprecate the use of macros in this way
because even if the macro occurs inside a function its definition will persist throughout the
code which risks hard-to-find bugs and greatly complicates code where different 2D spans
are needed in different parts of the code. Note also the precautionary brackets around i and j
in the macro definition, these are need for correctness if say i is passed as i+1.
To exploit shared memory, we can use each thread in a thread block to store different
elements of A and B in shared memory and then let all the threads in the block use all the
cached values to calculate contributions to the product. This is relatively straightforward
because matrix multiplication can be represented as a sum over products of 2D tiles defined
over the matrices as shown in the following equation:
X
MT
TCI, J ¼ TAI, t TBt, J ,
sanet.st
t¼1
where the Ts are rectangular tiles defined over the matrices and represents matrix multipli-
cation between tiles. The summation index, t, represents a summation over contributing Mt
pairs of tiles; the matrix product implies further summations over the individual matrix
elements within the tiles. The idea is shown in Figure 2.3 where we use a set of 3 3 tiles to
perform the multiplication of 9 9 matrices.
Tiled matrix multiplication can be readily implemented in CUDA kernels by using a
16 16 or 32 32 thread blocks to represent a pair of tiles from A and B. Each thread
then first copies one element of its thread block’s allocated A and B tiles into shared
memory arrays. Once this process is complete, the same threads can then compute
the elements of the tiled matrix multiplication to obtain that tile-pair’s contribution to a tile in
C. In this way each element of A and B is only read once from external memory instead of 16 or
32 times. Our implementation is the gputiled kernel shown in Example 2.16. In Figure 2.3
the element c45 is shown calculated conventionally in the top row and by tiled matrix
multiplication in the bottom row.
Example 2.16 gputiled kernel: tiled matrix multiplication using shared memory
01 #include "cx.h"
02 #include "cxtimers.h"
03 #include <random>
. . .
2.9 Tiled Matrix Multiplication 63
61 }
62 C[ay*Bx+bx] = csum; // store complete result
63 }
• Line 28.3: We use tilex to set both dimensions of the 2D thread blocks used to represent tiles.
While it is possible to use non-square tiles, that would complicate the kernel code.
• Lines 29–31: As before this is the timed block that launches a kernel and waits for completion. The
kernel launch itself is now changed because the guptiled kernel is written to use the value of
tilex as a template parameter. Here we use a 3-way if-else tree to allow values of 32, 16 or 8 for
this parameter. The kernel argument list is the same as before.
• Line 40: This is the start of our new guptiled kernel; the arguments are as before and we are now
using the restrict keyword by default for all pointers. Note that this is a templated kernel; thus
the tile size parameter TS is known at compile time.
sanet.stshared memory arrays to hold square tiles copied
• Lines 42–43: We declare two statically allocated
from A and B to Atile and Btile.
• Lines 44–45: Here we set the position of the current thread in the local TS x TS tiles. This depends
only on the thread block dimensions.
• Lines 46–47: Here we set ocx and ocy to the origin of the target tile in C using grid-block
quantities. These values are the same for all threads in the thread block.
• Lines 48–51: In the first two lines we set ax and ay to the current thread’s position in A based on the
first tile to be used. Similarly, in the second pair of lines we set bx and by for matrix B. Notice that
as we step to different tiles along the rows of A and down the columns of B ay and bx are constant
whereas ax and by change. In fact ay and bx are the i and j values of the cij element being
evaluated by the current thread.
• Line 51: The local variable csum is used to accumulate the current thread’s cij value; here we set it
to zero.
• Lines 53–61: Each pass through this loop performs matrix multiplication on one pair of tiles from A
and B and accumulates the result in csum.
○ Lines 54–55: Here we copy the current tiles from A and B to shared memory. Each thread copies
one element from A and one from B to Atile and Btile and will later read TS values back
from these arrays.
○ Line 56: An essential syncthreads here; no thread in the block can safely proceed until all the
the product.
○ Line 58: A second essential syncthreads; no thread can proceed to the next pass through the
○ Lines 59–60: Here we increment ax and by to point to the required position in the next tiles from
A and B.
• Line 62: Here we store the final result in C.
The result in the last line shows that gputiled delivers more than 1 TFlop/sec of processing. A tile
size of 32 32 works best on the RTX 2070 GPU used for this test.
We note that using shared memory as shown in Example 2.16 gives a significant
performance boost of about 250 GFlops/sec amounting to about 1.1 TFlops/sec overall.
Although not shown here, we did try running this example without using restrict in
kernel arguments and found only a small drop in performance. This is presumably because
we now read from A and B fewer times and hence the performance gain from using
restrict on the pointers to these arguments is less important.
There is one last trick we can try to squeeze a bit more performance from our code and that
is explicit loop unrolling:
. . .
52 float csum = 0.0f;
52.1 #pragma unroll 16
// step A tiles along rows of A
53 for(int t=0;t<gridDim.x;t++){
. . .
As shown in the last line of Example 2.17 we have gained another 100 GFlops/sec of
performance by using loop unrolling. The optimal depth of unrolling can only be found by
experiment; on our RTX 2070 the value 16 seems to give the best result. On other GPUs you
may find a different optimum. Tuning GPU code always involves some experimentation.
Note the NVCC compiler will often automatically perform loop unrolling and especially in
cases where the number of passes is known at compile time. For this reason, making the loop
counter a template parameter can be worthwhile. Here this is done for the inner loop over TS
but not for the outer loop over gridDim.x which is therefore not known at compile time.
Interestingly, we find that explicit unrolling over the outer loop helps but in experiments (not
shown) we found explicit unrolling over the inner loop does not help.
2.10 BLAS
Matrix multiplication is a classic problem in computational linear algebra and the results of
more than 50 years of development are encapsulated in the BLAS (basic linear algebra
66 Thinking and Coding in Parallel
subprograms) function libraries that are available for all serious computing platforms. BLAS is
used by calling appropriate functions to perform the desired operations. Matrix multiplication
of float-4 matrices can be performed by calling the sgemm (single precision general matrix
multiplication) routine which implements the saxpy-like operation C ¼ αAB þ βC for matri-
ces A, B and C. The good news is that BLAS is available for CUDA code. The NVIDIA
cuBLAS library is a set of host callable routines that run BLAS functions on the GPU using
vectors and matrices in GPU memory. In fact, although cuBLAS provides its own routines to
allocate and transfer arrays between host and GPU memories it is also perfectly possible to use
thrust (or any other method) to manage these arrays. Thus cuBLAS can be used in our matrix
multiply example with just a few modifications. Example 2.18 shows how BLAS routines can
be used in host code to replace the kernel calls used in our previous examples.
. . .
05 #include "cublas_v2.h"
. . .
10 int main(int argc, char *argv[])
11 {
. . .
20 thrust::host_vector<float> A(Arow*Acol);
21 thrust::host_vector<float> B(Brow*Bcol);
22 sanet.st
thrust::host_vector<float> C(Crow*Ccol);
23 thrust::device_vector<float> dev_A(Arow*Acol);
24 thrust::device_vector<float> dev_B(Brow*Bcol);
25 thrust::device_vector<float> dev_C(Crow*Ccol);
26 thrust::device_vector<float> dev_D(Crow*Ccol);
• Line 39: Because we are using a GPU equipped with tensor cores (i.e. having a CC of 7.0 or above)
we tell cuBLAS to use them where possible. Although originally realised as a tool for mixed
precision operations on 2 and 4-byte floats, recent versions of cublas can also use tensor cores to
speed up pure 4-byte float calculations. In a test we see a speed-up of about 30% using 1024
1024 matrices and speed-ups of up to a factor of two using larger matrices.
• Lines 40–45: This is the timed loop where the matrix product is calculated. The BLAS library is
great for performance but the functions have a dreadful user interface essentially unchanged from
their early Fortran origins in the 1950s. Also, they keep to the Fortran convention of expecting
matrices in column major format (i.e. elements in the same column are stored in adjacent memory
locations). This means that default C/C++ style matrices, in row major format, are treated as if they
had been transposed. While this does not matter for simple operations such as addition, it does
matter for matrix multiplication. Fortunately, the matrix functions such as cublasSgemm (the
cuBLAS version of sgemm), used in line 41, have flag arguments specifying whether the input
matrices A and B should be transposed before use. This results in correct matrix multiplication but
the resulting matrix C is still left in column major format. We correct this in line 43 by calling the
cuBLAS function cublasSgeam to transpose C back to row major format.
• Line 41: The call the cubalsSgemm function has many arguments as follows:
1. The mandatory cuBLAS handle
2. Transpose A if CUBLAS_OP_T or not if CUBLAS_OP_N
3. Transpose B if CUBLAS_OP_T or not if CUBLAS_OP_N
4. The number of rows of A (after transposition if done) and C; we use Crow here.
5. The number of columns of B (after transposition if done) and C; we use Ccol here.
6. The number of columns of A (after transposition if done) and rows of B (after transposition if
done), we use Arow here. This is the index that is summed in matrix multiplication.
sanet.st
7. Pointer to the scaling factor alpha.
8. Pointer to the matrix A.
9. Leading dimension of array used to hold A; we use Acol here.
10. Pointer to the matrix B.
11. Leading dimension of the array used to hold B; we use Bcol here.
12. Pointer to the scaling factor beta.
13. A pointer to the matrix C.
14. Leading dimension of the array used to store C.
For square matrices all the dimensions are the same and the interface is relatively forgiving; in other
cases significant care is required to get everything correct. We have allowed for the transposition of A
and B in our choice for argument 6 but not arguments 9 and 11. We have tacitly assumed that C is in
column major format for our choice of arguments 5, 6 and 14.
• Line 42: Set beta to zero before calling cublasSgeam.
• Line 43: Here we use the cublasSgeam function which evaluates C ¼ αA þ βB to transpose C.
This function is an NVIDIA extension to the standard set of BLAS functions. By setting α ¼ 1 and
β ¼ 0 we cause A to be copied to C with optional transposition of A if requested. The arguments for
cublasSgeam are as follows:
1–5. Same as cublasSgemm.
6. Pointer to alpha.
7. Pointer to the matrix A; we use C here.
8. Leading dimension of array used to hold first matrix; we use Crow here.
9. Pointer to beta. Note beta is set to zero in line 42.
10. Pointer to the matrix B; we use C here.
2.10 BLAS 69
11. Leading dimension of array used to hold B matrix; we use Crow here.
12. Pointer to the matrix C; we use D here.
13. Leading dimension of array used to hold C matrix; we use Ccol here. (This would be Dcol in
cases where C and D had different sizes).
• Lines 44–52: These are similar to before.
The results at the end of the example show performances of 6.7 and 8.9 TFlops for the RTX2070 and
1024 1204 matrices. The latter result was obtained using the tensor core processors available on
devices of CC ≥ 7.0.
The performance of the cublasSgemm is a factor of 6 or more better than our best
kernel. Moreover, tensor cores, if available, can be used to give further impressive speed-ups
of ~30% or more. Thus, while matrix multiplication is an excellent and much used calcula-
tion for demonstrating the use of shared memory in CUDA kernels, if you really need lots of
fast matrix multiplication, use the NVIDA library not your own kernels. Similar advice
applies to other standard problems such as FFT for which NVIDIA also has a good library.
We discuss NVIDIA’s full range of libraries in Appendix F.
Figure 2.4 shows how the performance of our matrix multiply routines varies as a
function of matrix size. The peak performance for the largest matrix sizes is over
15 TFlops for cublasSgemm with tensor cores. The curve labelled kernel corresponds to
the gputiled1 kernel. The curves labelled blas and blas+TC correspond to the two BLAS
routines. Note the peak performance achieved by the TC version of cuBLAS for the largest
matrices is over 15 TFlops; this is an astonishing performance from a £400 PC card.
In Chapter 11 we show you how to write your own matrix multiply kernels using tensor
cores; the shared memory version achieves about 5.6 TFlops compared to the 8.9 achieved
by cuBLAS. This is actually not bad as the cuBLAS library routines will contain many
detailed optimisations. The performance of this kernel is shown as the chap11 curve on
Figure 2.4.
The fact that fast libraries for standard calculations like matrix multiplication are available
does not mean that learning to write your own kernels in CUDA is unnecessary; there are
many situations where an out of the box solution is not readily available. For example,
simulation is very important in many areas. Models of your particular problem often have to
be hand-crafted and could gain enormously in speed if you are able to include code that
exploits the raw power of GPUs. We also note from Figure 2.4 that our matrix multiply
kernels are more competitive with cuBLAS for smaller matrices which might be useful when
many small matrix multiplications are required as part of a bigger program.
This concludes our second introductory chapter. In the next chapter we discuss warp level
programming which will provide important insights into how to get the best performance
from your GPU. The examples in Chapter 3 include further improvements to the
reduce kernels.
Endnotes Chapter 2
1 Flynn, Michael J. “Some computer organizations and their effectiveness.” IEEE transactions on com-
puters 100, no. 9 (1972): 948–960.
2 In practice the addition of floating-point numbers is not precisely commutative because the accumula-
tion of rounding errors can depend on the order in which the terms are added together. Once the sum
gets large, the contribution from subsequent small numbers is inaccurate or completely lost. This is a
sanet.st figures are useful. Interestingly parallel reduction
particular issue for F32 where only about 7 significant
techniques, where a number of partial sums are accumulated in parallel, are likely to be more robust than
a single serial evaluation.
3 I first encountered this approach to parallel programming when learning MPI in the mid-1990s and it
was a revelation. My previous encounters with trying to program multiple devices to run in parallel had
involved writing different programs for each device and hand tuning at the assembly level to make the
execution times identical on each device (MIMD) – a nightmare task compared to the common code
SIMD model of CUDA and MPI.
4 Specifically, the member functions the host and device vector class do not have __device__
definitions. Thrust was designed as a suite of host callable functions which ran on the GPU for speed.
Users of thrust were not expected to write their own kernels.
5 Our recommendation that threads should be a multiple of 32 is for performance reasons. Any value in
[1,1024] is allowed, it is just that values which are multiples of the warp size are more efficient. For
example, if you specified 48 then every thread block would be run with one full warp of 32 threads and
one ½ full warp of 16 threads leading to a 25% performance loss.
6 If you want your compiled code to run on different GPU models you can use the device query functions
in CUDA to find the value of Nsm at run time.
7 As a technical aside, we mention that the GPU hardware manages branch divergence at the warp-engine
level by maintaining a 32-bit active-thread bit mask for each active warp, the bits are turned on or off to
determine with threads execute in the currently scheduled instruction.
8 The Cooperative Groups feature does allow grid-wide synchronisation of all threads in a grid during
kernel execution but only if a number of restrictions are applied including having all thread blocks
resident on the device at once.
9 The modern C++ practice of declaring and initialising objects in the same statement (RAII) cannot be
applied to shared memory objects in CUDA kernels because the declaration is the same for all threads in
the kernel, but the initialisation is usually thread dependent.
2.10 BLAS 71
10 Actually, it should be second nature for you as a C++ programmer to always use longest first/shortest
last ordering in variable declarations for all your classes and structs as well as special cases like CUDA
dynamically declared arrays. This will achieve natural alignment for all your variables without the
compiler having to insert “hidden” padding.
11 Giving up on containers for kernel arguments means that we have to pass array dimensions explicitly as
separate arguments. This is a genuine loss and is a potential source of bugs. One advantage of containers
is that objects know their sizes.
12 Of course, if you use restrict it is your responsibility to ensure that aliasing does not occur – the
compiler still cannot actually check this.
3
The 32-thread warp has been a stable feature of NVIDIA GPU architecture since the early Fermi
generation.1 The GPU hardware processes all threads in a warp together using a dedicated subunit
of a GPU SM.2 From the Fermi to Pascal GPU generations, each warp was processed using a
single program counter and a 32-bit mask indicating which threads were active for the current
instruction. This meant that all active threads within a warp ran in strict lock step with each other
and thus were always implicitly synchronised. As a consequence of this design, explicit calls to
__syncthreads() in kernel code were only required to synchronise threads in different
warps but not for threads within the same warp. It was actually worthwhile to omit
__syncthreads() calls in kernel code when possible, because these calls, which synchron-
ise all the threads in the thread block, are expensive – a great deal of code used this implicit
intra-warp synchronisation and you may still see this trick used in older tutorial example code.
In 2017 NVIDIA released CUDA SDK version 9.0 with support for the Volta (CC = 7.0)
sanet.st
and Turing (CC = 7.5) generations of GPU. These new architectures break the implicit warp
synchronisation paradigm because their hardware has separate per thread program counters
which means that on these devices the threads within the same warp might not run in lock
step. Thus, historic code relying on implicit warp synchronisation might not run correctly on
the new GPUs.3 CUDA 9.0 also introduced cooperative groups as a powerful way of making
explicit any assumptions of synchronisation between the threads in a warp and generalising
these ideas to other sizes of thread groups. The introduction of cooperative groups promotes
warps and thread blocks to first-class Cþþ objects, allowing programmers to write straight-
forward code that makes it clear where warp-level ideas are being used.
The May 2020 release of CUDA SDK 11.0 contained a significant improvement to
cooperative groups in that grid level objects no longer need special treatment at compile time
and work with all CUDA capable cards. Our examples are based on that release. As explained
below the previous restrictions continue to apply if the grid.sync member function is used.
To illustrate cooperative groups, we will revisit the reduce kernels of Chapter 2. Example 3.1
shows a modified version of the reduce4 kernel from Chapter 2 which is safe for devices of
CC ≥ 7.0. This example uses the new __syncwarp() function to explicitly synchronise
threads in a warp whenever implicit warp synchronisation would have been used in the past.
72
Warps and Cooperative Groups 73
launched with the number of thread blocks set equal to blockSize. That means that the built-in
parameter blockDim.x will equal blockSize during kernel execution. The advantage of this is
that the NVCC compiler will delete portions of lines 21–24 at compile time keeping only those parts
which are necessary. This is a common trick in CUDA programming.
This kernel processes integer data because the types of the arguments data and sums are both set
to int, but the kernel can easily be modified to process other arithmetic types, and indeed we could
have simply used a second template parameter to generalise the kernel. However, that is not done here
to avoid clutter in the example.
• Line 15: This kernel uses statically allocated shared memory of one word per thread. The partial
thread sums are stored here.
• Lines 16–17: The variable id is set to the thread’s rank in its block using threadIdx.x and the
thread’s element of shared memory s[id] is set to zero.
• Lines 18–19: Here we accumulate the partial per thread sums in s[id]. In this loop each thread
uses a stride equal to the total number of threads to pick elements from the input array data.
• Line 19: Here __syncthreads is used so that all the partial sums will be valid beyond this
statement. This statement is necessary.
• Lines 20–23: Here we perform the tree sums necessary for larger block sizes. Notice that at compile
time the compiler will completely remove statements that fail the blockSize test and remove that
part of the test from the statements for which it is true.
• Line 24: For the final stages of the tree sum only threads with id < 32 participate and as these
threads are all in the lowest ranked warp of the thread block, we wrap a single if statement around
the final lines 26–33 of the kernel. This is warp-level programming and allows threads in other
warps to exit the kernel early.
• Lines 25–30: These are the final steps in the tree sum reduction process. Note the use of
__syncwarp() in these statements. The __syncwarp() is part of the warp-level programming
support introduced in CUDA version 9.0. A __syncwarp() call is cheaper to execute than
__syncthreads() and in fact does nothing on architectures with CC < 7 because for those
GPUs all threads in a warp always run in lockstep.
• Line 31: Here the thread with id = 0 stores the computed block-wide sum in an element of the
output array sums.
Example 3.1 reduce5 kernel using syncwarp for device of CC=7 and higher
18 __syncthreads();
19 if(blockSize > 512 && id < 512 && id+512 < blockSize)
s[id] += s[id+512];
20 __syncthreads();
21 if(blockSize > 256 && id < 256 && id+256 < blockSize)
s[id] += s[id+256];
22 __syncthreads();
23 if(blockSize > 128 && id < 128 && id+128 < blockSize)
s[id] += s[id+128];
24 __syncthreads();
25 if(blockSize > 64 && id < 64 && id+ 64 < blockSize)
s[id] += s[id+64];
26 __syncthreads();
27 if (id < 32) {
// syncwarps required for devices of CC >= 7.0
28 s[id] += s[id + 32]; __syncwarp();
29 if(id < 16) s[id] += s[id + 16]; __syncwarp();
30 if(id < 8) s[id] += s[id + 8]; __syncwarp();
31 if(id < 4) s[id] += s[id + 4]; __syncwarp();
32 if(id < 2) s[id] += s[id + 2]; __syncwarp();
33 if(id < 1) s[id] += s[id + 1]; __syncwarp();
In many tutorial reduction examples predating the introduction of the CC=7.0 architec-
ture, lines 25–30 would have been just the addition statements without either the if clauses or
the synchronisation calls. This was possible because all the threads in the warp ran in strict
lock step and thus there was no need for explicit synchronisation and also allowing all
threads to perform all the additions did not add to the execution time. Moreover, removing
the if clauses improved performance without effecting the correctness of the final sum.
(Although the final values in s[1–31] will have been corrupted, after they had been
correctly used, by subsequent “unnecessary” addition steps.)
In Example 3.1 we have used the relatively fast __syncwarp() calls to ensure the
correct behaviour on newer devices. More subtly, the if statements in lines 26–30 are also
necessary for CC≥7 correctness. The reason for this is to avoid possible “read after write”
errors; consider the statement s[id] += s[id+8] which is part of line 27; if it is made
conditional in (id <8) then only s[0–7] will be changed by adding s[8-15]. On the
other hand, if the addition is made unconditional and threads do not run in strict lock step
then one or more values in s[8-15] may have values from s[16–23] added before they
are added to s[0–7] resulting in an incorrect final total. This is sometimes referred to as a
“read after write error”.
3.1 CUDA Objects in Cooperative Groups 75
#include “cooperative_groups.h”
using cg = cooperative_groups;
Here the optional second line defines cg as short alias for the namespace cooperative_
groups. These definitions allow us to create first class Cþþ objects to represent the familiar,
thread block, grid and warp objects defined for any kernel launch. An example is shown in the
next box:
If you don’t like auto, the explicit types of these objects are:
cg::grid_group grid
cg::thread_block block
cg::thread_block_tile<32> warp
These statements create objects grid, block and warp which represent respectively the
grid, thread block and warp to which the particular thread belongs. Notice we are using the
built-in objects this_grid() and this_thread_block() to initialise grid and
block but we use the explicit value of 32 for the warp size in the definition of warp.4
The use of the Cþþ11 keyword auto in the declarations is helpful for avoiding mistakes
and keeping the code readable. These objects are first class Cþþ objects and can, for
example, be passed as arguments to device functions. There are no default constructors for
these objects, and both grid and block must be initialised in the way shown. You can, of
course, choose different variable names and copy these objects. Within your kernel code the
actual objects are lightweight handles pointing to data shared by all relevant threads.
The block and grid objects encapsulate the familiar properties of the thread block and grid
of thread blocks associated with any CUDA kernel launch and in particular provide an
alternative way of accessing the same information contained in the CUDA built-in variables
like threadIdx.5
Tiled partitions are more complicated; the example shown above divides the previously
created thread block block into sub-blocks of 32 contiguous threads representing the
individual warps that are used by the hardware. This is by far the most common use of tiled
partitions, but it is not the only possibility. Partitions can have sizes which are any power of
2 between 2 and 32 and can be defined on the thread block or on a previously defined larger
tiled partition as shown in the next box:
76 Warps and Cooperative Groups
CUDA SDK 11.0 all the other features of grid groups can be used in normal CUDA code. The
is_valid() member function allows a kernel to test whether or not the grid.sync()
function is available. Prior to SDK 11.0 in May 2020 none of the grid group functionality was
available without these restrictions and so many early tutorial examples may not make use of
grid groups.
The function block.thread_rank() returns the rank of the current thread in its
thread block, using the same 3D rank formula as used in Example 2.3 which reduces to the
1D linear thread rank formula in cases where a 1D thread block is used for the kernel launch.
The grid.thread_rank() function adds block_rank*block_size to this result
to obtain the rank of a thread within the entire grid of thread blocks. Example 3.2 shows how
Example 2.3 can be rewritten using cooperative groups.
• Lines 4.1–4.2: These lines add support for cooperative groups to the code.
• Lines 9–11: These perform the same calculation as before of the thread’s 3D rank in the gird but
using member functions instead of the corresponding built-in variables. We have still had to
remember the correct formulae with arguably a small increase in verbosity.
• Line 14: The member function block.size() replaces the previous formula for the thread block size.
• Line 15: There is no member function for the number of blocks in a grid: the ratio of the number of
threads in the grid and threads in a block is a simple way to find this quantity.
• Line 16: The member function grid.size() gives the total number of threads in the grid directly.
• Line 17: The member function block.thread_rank() gives the current thread’s rank in the
thread block directly. This is a frequently used quantity in kernel code and we have eliminated the
formula needed before.
• Line 18: Here we use a simple ratio to calculate the thread block’s rank in the grid replacing a more
complex formula.
• Line 19: The member function grid.thread_rank() gives the current thread’s rank in the grid
directly. Again, this is a frequently used quantity and we have eliminated the formula needed before.
Example 3.2 coop3D kernel illustrating use of cooperative groups with 3D grids
. . .
04.1 #include "cooperative_groups.h"
04.2 namespace cg = cooperative_groups;
05 __device__ int a[256][512][512]; // file scope
06 __device__ float b[256][512][512]; // file scope
07 __global__ void coop3D(int nx,int ny,int nz,int id)
08 {
08.1 auto grid = cg::this_grid();
08.2 auto block = cg::this_thread_block();
09 int x = block.group_index().x*block.group_dim().x+
block.thread_index().x;
78 Warps and Cooperative Groups
10 int y = block.group_index().y*block.group_dim().y+
block.thread_index().y;
11 int z = block.group_index().z*block.group_dim().z+
block.thread_index().z;
12 if(x >=nx || y >=ny || z >=nz) return; // in range?
13 int array_size = nx*ny*nz;
// threads in one block
14 int block_size = block.size();
// blocks in grid
15 int grid_size = grid.size()/block.size();
// threads in whole grid
16 int total_threads = grid.size();
17 int thread_rank_in_block = block.thread_rank();
18 int block_rank_in_grid =
grid.thread_rank()/block.size();
19 int thread_rank_in_grid = grid.thread_rank();
. . .
Our conclusion from Example 3.2 might be that although there has been some reduction in
the need to use error prone formulae there are no really new features so far.
The really interesting new feature is the treatment of warps which we will discuss in two
steps. Firstly, in Example 3.3 we show how sanet.st
warps and sub warps can be defined as Cþþ
objects and then we move on to discuss how the additional member functions in Table 3.2
can be used to significantly improve our reduction kernels.
. . .
08 template <int T> __device__ void
show_tile(const char *tag, cg::thread_block_tile<T> p)
09 {
10 int rank = p.thread_rank(); // thread rank in tile
11 int size = p.size(); // number of threads in tile
12 int mrank = p.meta_group_rank(); // rank of tile in parent
// number of tiles in parent
13 int msize = p.meta_group_size();
• Line 24: This defines what is effectively a sub-sub-tile by defining tile4 as a partition of tile8.
• Lines 25–32: Here we print results for the single thread having rank id in the grid of thread blocks.
The input parameter id is a user supplied value.
• Lines 34–41: Here the main routine sets id and the launch configuration for optional user input and
launches the cgwarp kernel.
The sample results shown for Example 3.3 represent different views of the same physical
thread regarded as a member of differently defined partitions. The difference between warp8
and tile8 is only seen in their “meta” properties with respect to different parents.
sanet.st
Table 3.2 Additional member functions for tiled thread blocks
Additional member functions for thread_block_tile objects. The template parameter T for the
shuffle functions can be int, long, long long (signed or unsigned) and float or double. With
the cuda_fp16.h header included T can also be __half or __half2. These functions implicitly
perform all necessary synchronisations with no risk of read after write errors. The shfl functions
return 0 when referencing values for threads which have exited early. The match_any and
match_all functions are only available for devices with CC ≥7.0.
Member Function Comment (tid is the lane of the calling thread)
T shfl(T var, int src) Return value of val in lane scr to all threads in tile
T shfl_down(T var, uint sft) Return value of val in lane tid+sft to thread tid. If
tid+sft is out of range return local value of val.
T shfl_up(T var, uint sft) Return value of val in lane tid-sft to thread tid. If
tid-sft is out of range return local value of val.
T shfl_xor(T var, uint msk) Return value of val in lane (tid XOR msk) to thread tid.
For example, setting msk to size()‒1 or simply ‒1,
reverses the order in which vals are held in the threads.
int any(int pred) Returns 1 if pred is non-zero for any thread in tile,
otherwise returns 0.
int all(int pred) Returns 1 if pred is non-zero for all threads in tile,
otherwise returns 0.
uint ballot(int pred) Returns a bitmask with bits set to 1 for active threads with
non-zero values of pred.
uint match_any(T val) Returns a bitmask which has bits set to 1 for active threads
having the same value of val as this particular thread.
uint match_all(T val, int If all active threads have the same value of val, returns a
&pred) bitmask of active threads and sets pred to 1. Otherwise
0 is returned and pred is set to 0.
3.2 Tiled Partitions 81
05 #include "cooperative_groups.h";
06 namespace cg = cooperative_groups;
. . .
10 template <int blockSize> __global__ void
reduce6(r_Ptr<float> sums,cr_Ptr<float> data,int n)
11 {
12 // This template kernel assumes blockDim.x = blockSize
13 // and that blockSize ≤ 1024
14 __shared__ float s[blockSize];
15 auto grid = cg::this_grid(); // cg definitions
16 auto block = cg::this_thread_block(); // for launch
17 auto warp = cg::tiled_partition<32>(block); // config
18 int id = block.thread_rank(); // rank in block
19 s[id] = 0.0f; // NB simplified thread linear addressing loop
20 for(int tid=grid.thread_rank();tid < n;
tid+=grid.size()) s[id] += data[tid];
21 block.sync();
22 if(blockSize>512 && id<512 && id+512<blockSize)
s[id] += s[id + 512];
23 block.sync();
24 if(blockSize>256 && id<256 && id+256<blockSize)
s[id] += s[id + 256];
25 block.sync();
26 if(blockSize>128 && id<128 && id+128<blockSize)
s[id] += s[id + 128];
27 block.sync();
28 if(blockSize>64 && id<64 && id+64 < blockSize)
s[id] += s[id + 64];
29 block.sync();
30 // just warp zero from here
31 if(warp.meta_group_rank()==0) {
32 s[id] += s[id + 32]; warp.sync();
33 s[id] += warp.shfl_down(s[id],16);
34 s[id] += warp.shfl_down(s[id], 8);
35 s[id] += warp.shfl_down(s[id], 4);
36 s[id] += warp.shfl_down(s[id], 2);
82 Warps and Cooperative Groups
• Lines 5–6: Here we show the inclusion of the header file needed to access cooperative groups and
the definition of cg as a shorthand for the cooperative group’s namespace.
• Line 10: The new kernel reduce6 is declared here and has the same interface as reduce5.
• Lines 15–17: These additional lines define objects representing the current thread’s warp, thread
block and also the entire thread-grid.
• Line 18: Equivalent to line 15 of the previous kernel except that we use b.thread_rank()
instead of threadIdx.x to find the rank of the current thread in the thread block.
• Line 20: This line replaces line 17 of reduce5 and computes the thread based partial sums of the n
values stored in the array data using thread-linear addressing. Notice how the expressions for the
for loop start and end values are simplified using the grid object.6
• Lines 21–29: These are identical to lines 19–26 of reduce5 except we use group.sync()
instead of __syncthreads().
• Lines 31–39: This is the code for the last partsanet.st
of the reduction performed by the lowest ranking warp
in the thread block; it is equivalent to lines 27–35 of reduce5. Notice the warp size of 32 threads is
still hard-wired into this part of the code; we could have used warp.size() instead of 32 here and
replaced lines 29–33 with a for loop, but we prefer to use this explicitly unrolled version for
clarity and performance.
○ Line 32: We use warp.sync() instead of __syncwarp() here. Since this code reads data
member function of warp to exchange local values between threads in the same warp. This
member function will implicitly perform the necessary intra-warp synchronisation between
threads including protection from read after write errors. Thus, we can dispense with both the
if(id<..) clauses and the __syncwarp() statements used in lines 28–33 of the previous
example. In fact, not only can we dispense with them – we have to! The shfl_up, _down and
_xor instructions shown in Table 3.3 only work if both the sender and receiver threads in the
warp execute the instruction. If the sender thread is inactive or exited, the result obtained by the
receiver thread is documented as undefined. More details are in Table 3.7 in our section on
thread divergence.
In lines 32–37 we have recovered the simplicity and efficiency of older codes with CC<7.0 using
lock-step implicit warp-level programming – but with the advantage of making our intentions explicit!
• Line 38: This final line which stores the block sum is identical to line 34 of the previous example.
Looking at the warp-level processing in Example 3.4 we notice that while the shfl_down
function can use shared memory to hold value being exchanged it does not need to, a local
3.2 Tiled Partitions 83
register will work just as well and could be slightly faster. This motivates us to think about
dispensing with shared memory entirely and write a third version of the reduction kernel,
reduce7, shown in Example 3.5; it just uses warp-level code and local registers.
• Line 10: The declaration of reduce7 is similar to before except that this is not a template function.
The reason for this is that this kernel version works for any thread block size (which is a multiple of
32). All tree reductions will be done in five steps by individual warps.
84 Warps and Cooperative Groups
• Lines 14–16: These define our usual CG objects; notice that there is no preceding shared
memory declaration.
• Lines 17–18: Here each thread in the kernel accumulates a sum over a subset of the input data. This
corresponds to lines 19–20 of Example 3.4 However, a local register based variable v is used instead
of the shared s array indexed by id. Dispensing with shared memory simplifies both the code and
the kernel launch; it also allows us to configure the GPU to use maximum L1 cache and minimum
shared memory on devices where this is possible.
• Line 19: Here we use warp.sync() instead of __syncthreads() as was needed in the
previous versions. Again, this is slightly faster.
• Lines 20–24: This is a warp-level tree reduction of the 32 local v values held by the threads in
each warp. Here we see the real power of the warp-level member functions to share data
between threads in the same warp. These lines correspond to lines 33–37 of the previous example.
• Line 25: At this point for each warp, the thread of rank zero in the warp holds that warp’s
contribution to the total sum. In order to retain the same output interface as the previous version
we would like to store just the sum of these individual contributions in a single element of the
output array sums. We solve the problem of accumulating multiple values in a single memory
location by using atomic addition to update the element of global memory. This is potentially an
expensive operation and also requires that the elements of sums to have been present to zero prior to
the kernel call. For GPUs of CC=6.0 and above, we could use the new atomicAdd_block
function which restricts atomic operation to just threads in the same thread block. This newer
version should be faster than the older atomicAdd which checks all threads in the kernel.
However, tests with the RTX 2070 show very little observable performance difference between
these two functions.
sanet.st == 0 instead of something like id%32 == 0
Notice also that we used warp.thread_rank()
to find the lowest ranking thread in a warp. Again, this makes the programmer’s intentions clearer.
The reduce7 example is not only more compact than the previous version, but it also
works for any thread block size that is a multiple of the warp size without the need for
template parameters or other tests.
Another interesting point to note about reduce7 is that, while we used the
shfl_down intrinsic function, we could equally well have used shfl_xor with the same
values for the second argument; these values would be interpreted as lane bitmasks for the XOR
operations rather than a shift down values. The effect of using these XORs would be to exchange
corresponding pairs of values rather than just shift down or return local values when out of range.
This in turn would, after five steps, leave all the 32 v values in the warp containing the same
correct sum, not just thread zero. This takes no extra time and could be helpful in other
applications where the sums are part of an ongoing calculation and needed by all threads in
the warp.
The topic of reduction has featured in CUDA tutorials from the beginning, mostly as a
way of illustrating shared memory. As the final reduce7 example shows, warp-only pro-
gramming on more recent GPUs can deliver the same or better performance without using
shared memory. The May 2020 CUDA SDK 11, which introduced the new CC 8 Ampere
generation of GPUs, takes this one step further by adding a new warp-level reduce function
to the cooperative group’s library.8 This function can replace lines 19–23 of the reduce7
kernel. This is shown in our final reduce example reduce8.
3.3 Vector Loading 85
05 #include "cooperative_groups/reduce.h"
. . .
10 __global__ void reduce8(r_Ptr<float> sums, cr_Ptr<float>
data, int n)
11 {
12 // This kernel assumes array sums is set to zero
13 // on entry and that blockSize is multiple of 32
14 auto grid = cg::this_grid();
15 auto block = cg::this_thread_block();
16 auto warp = cg::tiled_partition<32>(block);
the time for the second step is independent of the size of the input array. Thus, for large input
array sizes the time taken for the kernel to execute is dominated by the first step – but our
efforts so far have been devoted to optimising the second step.
We note that in the first step each thread reads a single 32-bit word from the input array
data on each pass through the read loop (e.g. line 17 of Examples 3.5 and 3.6). The reads
are properly coalesced in the sense that consecutive threads read consecutive words from the
input array, nevertheless we can improve performance by switching to reading 128-bit items
per thread; this is maximally efficient in terms of L1 cache use and allows the compiler to use
one 128-bit load and store instruction rather than four 32-bit instructions. This technique is
called vector-loading and is discussed in the NVIDIA blog https://devblogs.nvidia.com/
cuda-pro-tip-increase-performance-with-vectorized-memory-access/.
The modified version of the kernel reduce7_vl is shown in Example 3.7.
• Line 19: Here we sum the four components of v4 and store the result in the scalar variable v. The
scaler v now plays the same role as it did in the reduce7 example. As a subtle detail we note
different subsets of values from data will be summed to the v variables in the two versions of this
kernel. This could conceivably affect rounding errors.
• Lines 20–26: These are identical to lines 19–25 of reduce7.
The other kernels can be converted to use 128-bit vector loads in exactly the same way;
we will not show the code here but the modified kernels are available in our code repository.
We have run timing tests on the various reduce kernels with and without vector loading for
input arrays sizes of powers 2 between 215 and 229. Since reading global memory is the
performance limiting factor, the appropriate metric is the bandwidth in GB/sec calculated as:
Bandwidth in GB= sec ¼ ðinput array size in bytesÞ= 109 time taken
For comparison we also include the result of using the thrust::reduce library
function on data already stored in device memory. The results are shown in Figures 3.1
and 3.2 for the RTX 2070 GPU.
Getting good timing information for these fast memory bound kernels is actually quite
hard. The results shown here are based on the average times taken to run the kernels in a loop
for between 1024 and 65536 iterations using multiple copies of the data buffer data
spanning 229 4-byte words of global memory and rotating through different copies of data
on successive iterations in order to defeat memory caching. Without this precaution the
execution times for smaller data sizes just reflect the time required to reload data from L1 or
L2 cache. We also ran each kernel for several different launch configurations and in each
case used the fastest time obtained. The best launch configuration was found to vary between
kernels and data size, smaller data sizes required fewer blocks and threads.
88 Warps and Cooperative Groups
Figure 3.1 Performance of the reduction kernels on a Turing RTX 2070 GPU
sanet.st
It is clear from Figure 3.2 that by far the most important optimisation is switching to
vector loading. Beyond that the reduce7/8 kernels have the best performance but the
improvements over reduce6 are quite small for larger datasets.
In order to see the difference more clearly Figure 3.3 shows the fractional bandwidth
differences between reduce8 and reduce5/6/7 and between reduce8_vl and
reduce5/6/7_vl. The differences are calculated as 100(reduce5-reduce8)/
reduce8 for the case of reduce5 and similarly for the other kernels. From the figure
we can see that the reduce7 versions are better than either reduce5 or reduce6, but
these differences become less significant for larger data sets; presumably this is because for
large arrays memory access speed dominates the calculation time. The reduce7 kernel’s
performance is essentially the same as that of the new CUDA reduce function used in
reduce8. These results are for our CC 7.5 RTX 2070 Turing GPU. The next generation
CC 8 Ampere GPUs have hardware support for warp-level reduce so we might expect
reduce8 to perform better on these devices.
This ends our discussion of reduction operations on GPUs and the remainder of this
chapter explores further features of cooperative groups associated with partitioning warps
into subsets. This part can be skipped on first reading as you are unlikely to want these more
esoteric features of cooperative groups at first. The examples in the rest of this chapter are
frankly somewhat artificial, but we think you will find that the more application orientated
examples in the rest of the book are more rewarding.
modern CC≥7 devices. The match_all and match_any functions are new with CC=7.0
and only work on devices with CC≥7.0.
The warp-level intrinsic functions shown in the Tables 3.3 and 3.4 are presented mainly
for reasons of completeness and because you may well see them in older tutorial examples or
existing code. Although they can be used without creating CG objects, we think they are
harder to use than the member functions shown in Table 3.2 because of the need to specify
the mask argument for participating threads. We also think that creating tiled_parti-
tions to represent warps is always a good idea for warp-level CUDA code irrespective of
whether you need to use the member functions.
Another reason why Tables 3.3 and 3.4 are interesting is that the member functions for
tiled_partions used in our examples are presumably implemented by calling these
older functions with suitably constructed mask and width values.10
Note that T can be any of the 32 or 64-bit integer types, or float or double. If the cuda_fp16.h
header is included then __half and __half2 types are also allowed. The shfl functions return
0 when referencing values for threads which have exited early.
T __shfl_sync(uint mask, T val, int src, int width=warpSize)
For each non-exited thread in mask returns the value of val held by the thread whose ID is src. If
thread src is inactive the results are undefined. This is a broadcast operation. If width is less than
warpSize then each subsection of the warp behaves as a separate sub-warp with lanes in the range
[0,width-1]. Also src%width is used for the source lane, thus broadcasts are confined to sub-
warps and each sub-warp might use a different src value. Width must be a power of 2≤warpSize.
T __shfl_up_sync(uint mask, T val, uint delta, int width=warpSize)
The same as shfl_sync except that the fixed src lane is replaced by the calling thread’s lane minus
delta. For example if delta=3, a thread in lane 15 would receive the value from lane 12. If the
subtraction results in a negative lane value, the function returns the calling thread’s own value of val.
Effectively, values are shifted up delta positions within the warp or sub-warp.
T __shfl_down_sync(uint mask, T val, uint delta, int width=warpSize)
The same as shfl_sync except that the fixed src lane is replaced by the calling thread’s lane plus
delta. For example for delta=3, a thread in lane 15 would receive the value from lane 18. If the
subtraction results in a value > width-1, the function returns the calling thread’s own value of val.
Effectively, values are shifted down delta positions within the warp or sub-warp.
T __shfl_xor_sync(uint mask, T val, uint laneMask, int width=warpSize)
The same as shfl_sync except that the fixed src lane is replaced by the logical XOR of calling
thread’s lane with laneMask. For example for laneMask=0x1F and width = 32, the values
received by the calling threads would be in reverse order to the calling thread’s rank in the warp. On
present hardware the least 5 significant bits from the result of the XOR are used. Thus, if width is
less than warpSize the result of the XOR may be greater than the size of a sub-warp. In this situation
threads will not be able to access values in higher ranked sub-warps than their own and return their
own value of val but they will be able to access threads in lower ranked sub-warps than their own.
threads are active. In other cases we might know that in certain parts of our code that some
threads are or might be inactive or exited.
It is important to distinguish between exited and inactive threads; most CUDA functions
deal with exited threads gracefully but may have problems with inactive threads.11 For
example, _syncwarp() might deadlock if there are inactive threads while exited threads
are ignored. Inactive threads are a big potential problem if we are writing a general-purpose
device function. The function might be called by different kernels, and it should not make
assumptions about which threads are active when the function is called. In particular, inactive
threads cause problems when we use _syncthreads(), syncwarp() or any of the warp
shuffle functions. A related issue for the shuffle functions is out of range target threads, for
example, in the call w.shfl_down(v,16) threads 16–31 in the warp will be out of range
whereas threads 0–15 will not. Tables 3.5 and 3.6 summarise the possibilities and outcomes.
92 Warps and Cooperative Groups
NB The non-member warp shuffle functions take a bit mask specifying which
threads will participate in the exchange. Only those threads are used for range
calculations. Thus if threads 7–10 are excluded by the bitmask, thread 6 will
receive its value from thread 11 using shfl_down with a shift of 1.
In Table 3.5 it is important to note the different return values from active and inactive or exited
threads. In a case like our reduction, Example 3.4, where we use addition of the return values from
shfl_down, a return value of zero from an inactive thread might not be a problem but if we
were to use multiplication instead, an unexpected return value of zero would be catastrophic. If
it is known which threads are active then the bitmask used with the intrinsic warp shuffle
functions can be set to exclude inactive threads avoiding the problem of undefined return values
from inactive threads. In particular, note that the shifts used with shfl_up or shfl_down only
use threads included in the bit mask for determining target threads. For example, if threads 7–10
are excluded from the bitmask, thread 6 would receive the value from thread 11 when using
shfl_down with a shift of 1. Although you could work directly with the intrinsic functions and
manually setting appropriate masks, it is much better to use coalesced groups discussed below.
risk that not all threads will reach a particular __syncthreads() call in the code. This is a
particular risk when making conditional calls to __device__ functions that include calls to
__syncthreads().
Actually, the CUDA C programming guides for SDK version 9.0 and later explain that
pre-Volta hardware actually implemented a weaker form of __syncthreads(), to quote
from the guide:
“Although __syncthreads() has been consistently documented as synchronizing all
threads in the thread-block, Pascal and prior architectures could only enforce synchron-
ization at the warp level. In certain cases, this allowed a barrier to succeed without being
executed by every thread as long as at least some thread in every warp reached the
barrier. Starting with Volta, the CUDA built-in __syncthreads() and PTX instruc-
tion bar.sync (and their derivatives) are enforced per thread and thus will not succeed
until reached by all non-exited threads in the block. Code exploiting the previous
behaviour will likely deadlock and must be modified to ensure that all non-exited threads
reach the barrier.”
Example 3.8 shows a kernel, deadlock, designed to demonstrate deadlock issues with
divergent threads. The kernel partitions the threads in each thread block into three groups using
a user settable parameter gsync. Group A includes threads with rank in [0,gsync-1],
group B threads with ranks in [gsync,gsync*2-1] and group C threads with rank≥g-
sync*2. The idea is that the group A threads wait at one __syncthreads() and group
B wait at a different __syncthreads(). For CC<7 hardware this is insufficient to cause
deadlock – we need a third group of threads that neither calls either of the __syncthreads
() nor exits. In our example these threads are held at a barrier controlled by the thread of rank
zero which is held at the group A’s __syncthreads().
• Lines 16–18: In line 17 we launch the deadlock kernel with user supplied values for the number
of thread blocks (blocks) and the number of threads per block (warps*32). The int parameters
gsync and dolock are passed as kernel arguments. (Value of the blocks parameter actually
makes no difference to whether the kernel deadlocks or not but it is still included here as a user
adjustable parameter for consistency with our other examples).
• Line 21: This is the declaration of the deadlock kernel which takes two int parameters gsync
and dolock. As explained above, the parameter gsync controls a 3-way thread divergence; the
threads with ranks in the range [0,gsync-1] execute lines 26–29 while those with ranks in range
[gsync,2*gsync-1] execute lines 30–32 and threads with id’s ≥2*gsync execute neither of
these branches but continue from line 25 straight to line 33.
• Lines 23–25: Here we start the kernel by declaring a shared int variable lock which is initialised
to zero by thread 0. Line 25 is the usual __syncthreads() necessary after shared memory has
been initialised by one or more threads. After line 25 all threads will see the set value of lock. This
__syncthreads() is executed by all threads and has no potential to cause deadlock.
94 Warps and Cooperative Groups
This code will deadlock if called with default parameters because the threads in the third warp do
not call syncthreads and cannot exit before thread 0 sets lock to 1.
• Lines 26–29: The statements inside this if clause will be executed by threads in group A with ranks
less than the user supplied flag gsync. For example, this would be just the threads in warp 0 if
gsync=32 or the first half of warp 0 if gsync=16 or all of warp 0 and half of warp 1 if
gsync=48 and so on.
○ Line 27: Here we perform a __syncthreads() inside the if clause. If the size of the thread
variable lock to 1 allowing all threads to eventually progress beyond line 33 and exit.
3.6 Avoiding Deadlock 95
• Lines 30–31: The __syncthreads() in line 31 is executed for threads in group B with
ranks in the range [gsync,2*gsync-1]. If this path is taken by some threads, then there is
thread divergence and a strict reading of the early documentation implies that deadlock will
occur. In practice, for the GTX 970, there is no deadlock if threads simply execute different
__syncthreads().
• Line 33: Threads in group C arrive here directly bypassing both __syncthreads(). Note this is a
while statement not an if statement on the test for the value of lock. That means that all threads
will idle here waiting for thread zero to execute line 28 and thread zero can only do this if the
__syncthreads() in line 27 succeeds; thus we have the potential for deadlock even for older
devices with CC<7.
Preventing threads from simply exiting early in divergent code is essential for demonstrating
deadlock on CC<7 devices. If the switch dolock in line 33 is set to zero bypassing the check on
lock, then all threads with rank ≥2*gsync would simply exit at this point and any pending
__syncthreads() would then notice these exits and allow the lower ranking threads to continue
past their barriers as the exited threads automatically satisfy sync operations in CUDA.
• Line 34: The thread of rank zero prints a message which will only be seen if the kernel terminates
normally.13
Some results from running Example 3.8 on two devices are shown in Table 8.3. Entries
where the two devices behave differently are shown in bold. looking at the results in
Table 3.7 we see the following:
• Row 1: Here there are two warps and all threads in a given warp execute the same
__syncthreads() but the two warps execute different __syncthreads(). The are
no deadlocks in this case.
• Row 2: Here we have three warps; the first two behave exactly the same as in row 1, but all
the threads in the third warp (warp 2) bypass lines 27 and 31 going straight to the check in
line 33. In the case dolock=0 all these threads exit and, therefore, do not block the
pending __syncthreads(). However, in the case dolock=1 all the threads from the
third warp stall at link 33 and block the pending __syncthreads().
• Row 3: Here the lower ranked half of warp 0 is in group A and goes to line 27 and the
higher ranked half is in group B and goes to line 31. All threads from warp 1 go straight to
line 33. The GTX card behaves as it did for row 2 but the RTX card deadlocks even for
dolock=0. This is because for CC≥7 devices the hardware really checks that all non-
exited threads for a given warp call the same __syncthreads.
• Row 4: Here there are 3 warps and all threads are either in group A or group B and reach line
27 or line 31. But half the threads in the middle warp (warp 1) belong to group A and half to
group B. Here the value of dolock has no effect because no thread goes directly to line 33.
The GTX kernel runs in both cases and the RTX it deadlocks in both cases. The deadlock on
the RTX is because threads from warp 1 are held at different __syncthreads().
• Row 5: This is similar to row 4 except that threads from the highest ranked warp go
straight to line 33 and for dolock=1 are blocked by pending __syncthreads(). Thus
now the CC<7.0 device also deadlocks when dolock=1.
After these experiments we conclude that the following statements appear to apply to
avoiding deadlock when using __syncthreads():
96 Warps and Cooperative Groups
“For devices of CC≥7 For all warps in each thread-block all non-exited threads in a
particular warp must execute the same __syncthreads() call but different warps can
execute different __syncthreads() calls.”
“For devices of CC<7 For all warps in each thread-block, at least one thread from
each warp having non-exited threads must execute a __syncthreads() call.”
Notice that avoiding deadlock is a necessary but by no means a sufficient condition for
ensuring your code is correct. The best advice is to avoid using __syncthreads() in
situations of thread divergence. Pay particular care with function calls in divergent code – are
you sure those functions do not use __syncthreads()?
If all the above looks complicated remember that __syncthreads() is fine in non-
divergent code but in divergent code, try to use only warp-level synchronisation which is
much more flexible because the NVIDIA cooperative groups library includes tools to
sanet.st
manage intra-warp thread divergence in a simple way.
A final remark on Example 3.8 is to point out that using __syncthreads() in sections of
divergent code is a clear error! The whole purpose of calling __syncthreads() is to
synchronise all the non-exited threads in the thread block. A more realistic situation that really
can occur is where a device function is called in divergent code and that function calls
__syncthreads(). This is especially true in larger projects where you might not have access
to the source code of the device function being called. Ideally writers of such functions should use
a weaker version of __syncthreads() which just acts on the currently active threads – just
this facility is provided by the coalesced groups feature in the NVIDIA cooperative groups library.
• Lines 27 and 32: These each create a coalesced group object a which is local to the scope
of the enclosing conditional statement. This object has all the functionality of a 32-thread
tiled_partition representing the local warp but adds a hidden bitmask to all
member functions selecting just the active threads.
98 Warps and Cooperative Groups
• Lines 28 and 33: Here we use the a.sync() member function to perform a
__syncwarp() call restricted to the subset active threads.
Examples 3.8 and 3.9 simply demonstrate how to cause and avoid deadlock in thread
divergent code. A useful version of 3.9 would have added code between lines 27 and 28
and between lines 32 and 33 where the added code performs useful warp-level work using
the available threads. If such added code calls device functions, then the coalesced group
object a could be passed as a function argument or the function itself could create its
own version.
Example 3.10 reduce7_vl_coal kernel which uses subsets of threads in each warp
14 auto g = cg::this_grid();
15 auto b = cg::this_thread_block();
16 auto a = cg::coalesced_threads(); // active threads in warp
26 if(a.thread_rank() == 0)
atomicAdd(&sums[b.group_index().x],v);
27 }
. . .
40 __global__ void reduce_warp_even_odd
(r_Ptr<float>sumeven, r_Ptr<float>sumodd,
cr_Ptr<float>data, int n)
41 {
42 // divergent code here
43 if (threadIdx.x%2==0) reduce_coal_vl(sumeven,data, n);
44 else reduce_coal_vl(sumodd, data, n);
45 }
3.7 Coalesced Groups 99
• Lines 10–27: The bulk of the device function reduce_coal_vl is identical to that of the kernel
reduce_warp_vl in Example 3.7. The only difference is that we use a coalesced_group
object a instead of a tiled_partition object w in lines 15 and 20–26. Since line 10 is
unchanged only a subset of the values in data will be summed, this is because threads which have
diverged from the active coalesced group do not participate.
• Lines 16–21: Here each active thread sums a subset of values in data using a stride equal to the grid
size and starting with the true rank of the thread in its thread block. The only difference is the use of
a.sync() in line 21 to synchronise just the subset of active threads in a warp.
• Lines 21–25: These replace the 5-step tree reduction of lines 22–26 of Example 3.7. The if clauses
based on the value of a.sync() do not cause any thread divergences but generalise the code to
work for size of the coalesced group which is a power of 2. If we just wanted the function to work
for the single value of 16, we could remove line 21 and use the if clauses in lines 22–25. Note the
code will fail if a.size() is not a power of 2 because then some of the shfl_down() calls will
have out of range offsets and hence return the calling thread’s value of v.
• Line 26: Here the lowest ranking thread within a is used to add the final group sum to the
appropriate element of sums.
• Lines 40–45: This is the kernel function called by the host to perform the calculation. The kernel
arguments are two arrays sumeven and sumodd which are used to hold separate block sums from
even and odd threads.
• Lines 43–44: Here reduce_coal_vl is called for all even ranking threads in the thread blocks in
line 43 and a separate call is made in line 44 for all odd threads. Thus, there is intra-warp thread
divergence and hence potential for deadlock. On devices with CC<7 for each warp the two function
calls will execute sequentially on the GPU because of the thread divergence. On devices with CC≥7
there may be some overlap of execution between these calls.
In testing we find that indeed this kernel does not deadlock and gives correct results. The
kernel runs about 1.8 times slower than Example 3.7 using a Turing RTX 2070 GPU. This
100 Warps and Cooperative Groups
suggests that for this CC 7.5 card there may be a modest overlap of calculation between the
divergent threads of a warp.
We note that line 18 of Example 3.10, where sub-sums of the values in data are
accumulated by individual threads in the grid, is unmodified from the previous example
where all threads in a warp participated. Thus, values in data that “belong” to diverged
threads are simply not counted and the final values stored in sums correspond to just the
elements of data which belong to the active threads. This is the behaviour we wanted for this
artificial example where even and odd threads accumulate separate sub-sums which can be
combined to give the correct total at the end of the calculation.
As a final example in this section in Example 3.11 we show a revised version of 3.10 able
to calculate a complete sum with any size of coalesced groups providing that each warp has
at least one active thread. In this version it is not necessary for each warp to have the same
number of active threads.
• Lines 24–28: This is the summation of the elements of just the current sub-block using the active
threads in the warp. This code is equivalent to lines 17–20 of the previous examples which summed
over all the elements of data. Notice we use a.sync() in line 28.
• Lines 31–36: These lines deal with the difficulty that the number of active threads a.size() is
probably not a power of 2. The idea is to find the greatest power of 2 kstart less than or equal to
a.size(). Then v values held by threads with rank greater than kstart are added to the v
values of the lowest ranking threads. After this has been done we can use a simple power of
2 reduction starting with kstart/2.
○ Line 31: Here we calculate kstart using the CUDA intrinsic function __clz() (count leading
zeros) which returns the number of leading zero bits in its integer input argument.
○ Line 33: Here we set a temporary variable w to the value of v held by the thread of rank
kstart higher than the current thread. This will only be valid for thread of rank less than
a.size()-kstart.
○ Line 34: Adds w to v for low-ranking threads where the value of w is valid.
• Line 38: The remainder of the power of 2 reduction is now performed using a for loop. Explicitly
unrolling this loop, as was done in previous examples, is not possible here as the value of kstart
is not known at compile time.
• Line 39: This use of atomicAdd is the same as for previous examples.
• Lines 50–53: A simple demonstration driver kernel for reduce_coal_any_vl. In this case we
launch the kernel using every third thread. This means that some warps use 10 threads and some
use 11.
Either
auto mg = this_multi_grid();
or
multi_grid_group mg = this_multi_grid();
Synchronization across multiple GPUs is possible with
mg.sync();
The SDK 11 release brought in a few new features in addition to the reduce function
mentioned previously; we mention them here for the sake of completeness but interested
readers are referred to the NVIDIA documentation and examples for more details.
1. A new type of coalesced sub-group is introduced the labled_partition:
cg::coalesced_group a = cg::coalesced_threads();
cg::coalesced_group lg = cg::labeled_partition(a,label);
or
auto a = cg::coalesced_threads();
auto lg = cg::labeled_partition(a,label);
Where label is an integer that may vary between threads. A separate coalesced group is created
for each value of label.
accelerated and bypasses the L2 and L1 caches. This feature is intended to improve
the flow of matrix data through tensor cores as discussed in Chapter 11 but interestingly
it also has the potential to boost the performance of our reduce kernels which use
shared memory. More information can be found in the recent blog post at:
https://developer.nvidia.com/blog/cuda-11-features-revealed/. An example can be
found in the new SDK 11 globalToShmemAsyncCopy example in the
0_Simple directory.
This ends our chapter on CUDA cooperative groups; the main message is that they give
strong support to warp-level kernel programming and make explicit and safe some of the
tricks that were used for implicit warp-level programming prior to the arrival of CC = 7.0
and CUDA SDK release 9.0. More recent SDKs add additional tools showing NVIDIA’s
interest in this area.
The next chapter moves on to another topic favoured in many tutorials – parallel stencil
calculations for solving partial differential equations and image processing. We think you
will find these applications interesting.
Endnotes Chapter 3
1 The first-generation Tesla cards featured 16-thread half-warps.
2 NVIDIA does not seem to have given these subunits a specific name, but they are clearly indicated in the
relevant diagrams in the documentation, for example Figure 7 in the Pascal Architecture white paper.
I like to think of these units as “warp-engines”.
sanet.st
3 It should still work on older GPUs but might fail on Volta and later GPUs. Because possible failure
depends on a race condition this is potentially a nightmare debugging scenario. We very strongly advise
you to upgrade existing code to use the methods described later in this chapter.
4 Unfortunately, we cannot use the built-in parameter warpSize for this purpose because it is not defined
as const in the CUDA header files.
5 In fact, if you inspect the header file cooperative_groups.h you will see that many of the member
functions are no more than wrappers for the CUDA built-in variables. However, using the cooperative
groups leads to clearer, more portable code than using the built-in variables directly.
6 This nice simplification has only been possible in general kernel code since the release of CUDA SDK
11.0.
7 Indeed, since first writing this sentence NVIDIA has introduced a set of “Warp Matrix Functions” to
support the powerful new tensor core hardware in CC=7 and above GPUs. See Chapter 11 for
more details.
8 For devices of CC 8 and above there are also new intrinsic functions that perform the same operations.
9 CC≥7 architectures do support some intra-warp thread divergence and we might expect such support to
improve in future architectures.
10 This can be seen by inspecting the CUDA cooperative_groups.h include file.
11 The GPU hardware maintains a bitmask showing exited threads which allows the hardware to ignore
exited threads without additional overheads. (Although of course, the potential work that could have
been done by these threads had they not exited is lost.)
12 The return value of zero from inactive and exited threads was found by experiment; the behaviour in
these cases is documented as undefined. Therefore, it would be best not to rely on zero values in future.
3.8 HPC Features 105
13 This is a feature of the operating system (at least on Windows), the results of in kernel printfs are
buffered by CUDA and are displayed after a kernel’s normal exit. If a kernel deadlocks this does
not happen.
14 “Currently active” means active at the time the object a is instantiated, for most code that should be good
enough. If you have really tricky real-time code, perhaps with program interrupts, then maybe you need
to take care – but such situations are beyond the scope of this book.
15 We could have used the old built-in variables such as gridDim and blockDim instead of using CG
objects but that really amounts to the same thing and we think it better style not to mix the two notations.
4
Parallel Stencils
In this chapter we discuss the use of 2D and 3D stencils for iterative solution of partial
differential equations and image processing. These applications are another classic applica-
tion of GPUs. We start with the 2D case and then move on to 3D.
4.1 2D Stencils
In numerical computing problems such as the solution of partial differential, the
equations can often be approximately solved by replacing the continuous functions
involved with discrete approximations using function values evaluated on a grid of equally
spaced points.1 For example, the function f ðxÞ can be replaced by the set of equally
spaced points along the x axis f. . . , f 2 , f 1 , f 0 , f 1 , f 2 . . .g where f i ¼ f ða þ ihÞ, a is
some convenient origin, h is the spacing between points, and the integer i labels the
sanet.st
points. In this notation the derivatives of f can be replaced by their finite difference
approximations:
df f jþ1 f j1 d 2 f f jþ2 2f j þ f j2
¼ , and 2 ¼ : (4.1)
dx x¼aþjh 2h dx x¼aþjh h
Or more simply:
1
f 0 0 ¼ ð f 1 f 1 Þ and f 0 00 ¼ ð f 1 2f 0 þ f 1 Þ (4.2)
2
for the derivatives at the origin in the case where we assume the grid spacing is unity (i.e.
h=1). In this case Eq. 4.2 can be represented by the stencils
df 1 d2f
! ½1, 0, 1 and ! ½ 1, 2, 1 , (4.3)
dx 2 dx2
which show the coefficients required to be applied with respect to a point at the centre of the
stencil to calculate the derivates at that point. Similar stencils exist in 2 and 3 dimensions and
in particular the Laplacian operator is given by:
2 3
0 1 0
∂2 f ∂2 f
þ r2 f ! 4 1 4 1 5 (4.4)
∂2 x ∂2 y 0 1 0
and
106
4.1 2D Stencils 107
2 3 2 3 2 3
0 0 0 0 1 0 0 0 0
∂2 f ∂2 f ∂2 f 40
þ þ r 2
f ! 1 05 þ 41 6 15 þ 40 1 0 5:
∂2 x ∂2 y ∂2 x 0 0 0 z¼1 0 1 0 z¼0 0 0 0 z¼þ1
(4.5)
In 3D the stencil becomes a 3D cube as indicated by the 3 slices shown in Eq. 4.5. We can
now use these stencils to find steady state solutions of many standard partial differential
equations in physics, for example, Poisson’s equation in electrostatics in Eq. 4.6 or the
diffusion equation for heat transfer in Eq. 4.7.
ρ
r2 ϕ ¼ , (4.6)
ε0
1 ∂u
r2 u ¼ : (4.7)
D ∂t
In Eq. 4.6 ϕ is the electrostatic potential due to a static electrical charge distribution ρ and in
Eq. 4.7 u is the temperature at time t due to a static distribution of heat sources. To solve such
equations, we need to apply suitable boundary conditions. For our purposes it is simplest to
use Dirichlet boundary conditions where fixed values for ϕ or u are specified on the
boundaries of the grids used for the numerical solution. We will seek steady state solutions
for Eq. 4.7 where the right-hand side becomes zero. These solutions apply equally to
Poisson’s equation for the potential inside a closed region with boundaries at
fixed potentials.
Our first example uses a 2D square grid with points on the boundaries set to fixed values
and where the interior contains no additional charge or heat sources, thus inside the grid we
need to solve r2 u ¼ 0 leading to
2 3
0 1 0
4 1 4 1 5u ¼ 0 ! u1, 0 þ u1, 0 þ u0, 1 þ u0, 1 4u0, 0 ¼ 0,
0 1 0 (4.8)
u1, 0 þ u1, 0 þ u0, 1 þ u0, 1
u0, 0 ¼ :
4
We can attempt to solve Eq. 4.8 for any given boundary conditions by setting the interior of
our grid to zero and then iterating, replacing each value inside the boundary by the average
of the 4 surrounding values. The boundaries themselves are left untouched. This process,
known as Jacobi iteration, with suitable boundary conditions turns out to be stable and will
converge, albeit slowly, to the correct solution. Example 4.1 shows our code to perform
Jacobi iteration inside a rectangle.
16 b[idx(y,x)] = 0.25f*(a[idx(y,x+1)] +
a[idx(y,x-1)] + a[idx(y+1,x)] + a[idx(y-1,x)]);
17 }
37 thrustHvec<float> a(size);
38 thrustHvec<float> b(size);
39 thrustDvec<float> dev_a(size);
40 thrustDvec<float> dev_b(size);
48 cx::timer tim;
49 for(int k=0;k<iter_host/2;k++){ // ping pong buffers
50 stencil2D_host(a.data(),b.data(),nx,ny); // a=>b
4.1 2D Stencils 109
51 stencil2D_host(b.data(),a.data(),nx,ny); // b=>a
52 }
53 double t1 = tim.lap_ms();
54 double gflops_host = (double)(iter_host*4)*
(double)size/(t1*1000000);
• Line 10: This declares the stencil2D kernel with arguments being pointers to the input array a
and output array b and values for their common dimensions nx and ny.
110 Parallel Stencils
• Line 12: Here we define the helper indexing function idx using a stride of nx.
• Lines 13–14: Find the array element to be processed by the present thread. It is assumed the kernel
launch will use a 2D thread grid of sufficient size to span the arrays.
• Line 15: This is an important check that the current thread is not out of range and also not on any of
the edges of the 2D array which are set to the fixed boundary conditions.
• Line 16: Here we implement Eq. 4.8 by summing the 4 elements of a surrounding the current point
and storing the average in b.
• Lines 18–25: These lines are the equivalent host function stencil2D_host which performs the
same calculation using a double for loop over the elements of a.
The remaining lines show the host code needed to drive the calculation.
• Lines 32–36: Get values for the user supplied parameters; nx and ny are the array dimensions and
iter_host and iter_gpu are the number of iterations to perform.
• Lines 37–40: Declare the a and b arrays on the host and device. Note that thrust will automatically
set these arrays to zeros.
• Lines 41–45: Initialise the first and last columns of a and b to the boundary value 1. The corners are
then set to 0.5 as they belong to both a vertical and horizontal side. Note the top and bottom rows of
a and b have been set to zero by thrust.
• Lines 46–47: Copy the host arrays to the device arrays.
• Lines 48–54: This is the timed section where the stencil2D_host function is called
iter_host times.
○ Lines 50–51: Here we call the stencil2D_host twice within the for loop; the first call
processes the values in a and stores the results in b and the second call does the opposite; thus
at the end of the loop the final result is always in a. Note there is a tacit assumption the
sanet.st
iter_host is an even number.
○ Line 54: Here we calculate the performance of the calculation in GFlops/sec noting that 4 floating
point operations are used in line 23. The value is stored in gflops_host.
• Lines 55–56: Set the kernel 2D launch configuration, using a thread block size of 16 16 and a grid
size sufficiently large to space arrays of size nx × ny.
• Lines 57–61: This is the timed section for the stencil2D kernel and is similar to the corresponding
host section. We use the same ping-pong technique with two calls to the kernel inside the for loop.
• Line 64: Copy the final GPU result back to the host vector a. In a real-world program further
calculation would be done at this point.
• Lines 65–66: Calculate the GPU performance gflops_gpu and the host to GPU speed-up.
• Lines 67–69: Print some results.
single call from the host. This, however, is not possible for the kernel calls. For a typical launch
configuration using a thread block of size 16 16 and array sizes of 256 256, we need
256 thread blocks in launch grid. These blocks can run in any order and, while CUDA provides
the _syncthreads() call to synchronise the threads within one block, it does not provide a
way of synchronizing threads between thread blocks. Note the stencil algorithm used to update
a given tile requires not just the values in the tile itself but also the values in a one element wide
halo around the tile. Hence in the GPU code the only way to ensure all values are kept in step
is to wait for the entire kernel to finish one iteration before starting the next.2
To use our double buffering method with CUDA kernels we have to ensure that a kernel
call using the array a to update the array b has completely finished before the next kernel call
using the array b to update a. If it were not for the halo problem, each thread block could
have proceeded independently using _syncthreads() at the end of each of its passes.
As already mentioned, the amount of calculation done by the stencil functions shown in
Example 4.1(a) is small, there are only 4 useful floating-point operations (3 adds and one
multiply) in the inner loop for the host version or per thread for the kernel function. One way
to attempt to optimise this code is to notice that each element of the input array is read
4 times by different threads in the kernel or different passes through the for loops in the
host code. Using cache to exploit this should improve performance. Using shared memory is
an obvious way to proceed for this kernel and indeed stencil calculations are used as standard
examples of shared memory in many CUDA books and tutorials.
It is straightforward to use 2D thread blocks, of, for example, size 16 16, and copy tiles
from the input array to shared memory as was done for the matrix multiplication example in
Chapter 3. However, there is a significant complication in that to perform a stencil calculation
for any array element we need to read values from its neighbours. This means that at the edges
of the tile we need values not stored in the tile. A simple widely used solution is to use
overlapping tiles in shared memory where the tile edges are a “halo” region containing values
necessary for the calculation whose corresponding elements are updated by other overlapping
112 Parallel Stencils
thread blocks. This means that for a 16 16 shared memory tile with a halo which is one
element wide, only the inner 14 14 values correspond to elements that will be processed by
the thread block but these inner elements have access to all of their 8 nearest neighbours in
shared memory.3 A kernel based on the idea is shown in Example 4.2; note the tile size
(typically 16 16) is specified at compile time using the template parameters Nx and Ny; this
may allow the NVCC compiler to make additional optimisations during compilation.
// inside array ?
24 if(xa < 1 || ya < 1 || xa >= nx-1 || ya >= ny-1) return;
// inside tile ?
25 if(xs < 1 || ys < 1 || xs >= Nx-1 || ys >= Ny-1) return;
26 b[idx(ya,xa)] = 0.25f*(s[ys][xs+1] + s[ys][xs-1] +
s[ys+1][xs] + s[ys-1][xs] );
27 }
• Lines 15–20: This set of statements sets up the necessary variables for creating shared memory tiles
which map to overlapping regions of the arrays a and b for adjacent thread blocks.
○ Lines 15–16: Then we set the origin (x0,y0) of the tile for the current thread block in the arrays
a and b. For non-overlapping tiles we would simply use blockDim.x/y, but here we subtract
2 which is twice the halo width. Thus for overlapping 16 16 tiles the stride between adjacent
tiles is 14 in both x and y.
○ Lines 17–18: The current thread’s position (xa,ya) in the arrays a and b is set here.
○ Lines 19–20: The current thread’s position (xs,ys) in the shared memory tiles is set here; this is
Contrary to our expectations, this shared memory kernel 4.2 is about 20% slower than the
previous 4.1 version which relied on implicit GPU memory caching. The reason for this is
that the new kernel only reuses items in shared memory four times and the savings compared
to cached accesses to global memory are too small to compensate for the time required to set
up the shared memory tile and halo. Note early generations of NVIDIA GPUs had poorer
GPU memory caching and so use of shared memory for stencil calculation was
more advantageous.
If we generalise to a 9-point stencil, as shown in Example 4.3, then for larger problem
sizes, shared memory version is only about 8% slower for large array sizes. Example 4.3
implements the general 9-point stencil shown in Eq. 4.9:
X
1 X
1
ui j ¼ cp q uiþp, jþq , (4.9)
p¼1 q¼1
where the coefficients c sum to unity. In the kernel code the coefficients are passed from the
host as a fifth array argument.
13 int x = blockIdx.x*blockDim.x+threadIdx.x;
114 Parallel Stencils
14 int y = blockIdx.y*blockDim.y+threadIdx.y;
15 if(x<1 || y <1 || x >= nx-1 || y >= ny-1) return;
16 b[idx(y,x)] =
c[0]*a[idx(y-1,x-1)] + c[1]*a[idx(y-1, x)] +
c[2]*a[idx(y-1,x+1)] + c[3]*a[idx(y ,x-1)] +
c[4]*a[idx(y , x)] + c[5]*a[idx(y ,x+1)] +
c[6]*a[idx(y+1,x-1)] + c[7]*a[idx(y+1,x )] +
c[8]*a[idx(y+1,x+1)];
17 }
The use of the idx index function in line 16 of stencil9PT significantly contributes to
the clarity of the code. Only lines 10 and 16 differ from stencil2D. The corresponding
modifications to Example 4.2 (not shown) are similar.
The performance of the stencil codes for various 2D problem sizes is shown in Figure 4.2.
The grey lines correspond to the kernels and host code shown in Examples 4.1 and 4.2 using
4-point stencils and the black lines are for the equivalent 9-points stencils as shown in
Example 4.3 for GPU. The GPU version of the 9-point stencils delivers about three times
more GFlops/sec than the 4-point equivalents. This is because each thread in a 9-point stencil
performs 17 floating point operations compared to four in 4-point versions and as the kernel
is memory bound the extra computation is “free”. It can be seen that using shared memory
sanet.st
never helps on our RTX 2070 although for small array sizes, where kernel performance is
poor, the two methods converge.
For the 2D stencil codes considered here the additional complexity introduced by using
shared memory is not justified. On earlier GPUs, memory caching was less efficient and so
the benefits of using shared memory were much greater and this is reflected in the emphasis
placed on shared memory codes in many available tutorials. However, if you are developing
new code for modern GPUs, I recommend starting with a simpler version not using shared
memory and once that is working consider trying shared memory. The number of times each
item in shared memory is used will determine the likely performance gains. Example 4.2 is
still a good model for implementing tiled code with halos in suitable problems.
The first 5 images show the results after 3 103, 104, 5 104, 105 and 4 105 iterations.
The sixth image shows equal value contours on the converged solution shown in the fifth
image. The figure on the right shows horizontal profiles across the centres of the five images.
This shows that the slow convergence of the Jacobi iteration process is due to the slow rate of
propagation of the fixed boundary values inwards to the centre of the array which was
initialised to zeros.
A more general convergence test for this problem is to compare the a and b arrays as the
iterations proceed and stop once the differences become small enough. In practice, if we
keep iterating we find that either a and b become identical or the iterations become trapped
in a limit cycle where a becomes b and vice versa on each iteration. In either case this is the
best we can do and we can implement a convergence test based on how small the largest
absolute difference between any pair of corresponding a and b values. The code for this is
shown in Example 4.4.
15 __shared__ T s[256];
16 int id = block.thread_rank();
17 s[id] = (T)0;
22 block.sync();
23 if(id < 128) s[id] = fmaxf(s[id],s[id + 128]);
block.sync();
24 if(id < 64) s[id] = fmaxf(s[id],s[id + 64]);
block.sync();
25 if(warp.meta_group_rank()==0) {
26 s[id] = fmaxf(s[id],s[id + 32]); warp.sync();
27 s[id] = fmaxf(s[id],warp.shfl_down(s[id],16));
28 sanet.st
s[id] = fmaxf(s[id],warp.shfl_down(s[id],8));
29 s[id] = fmaxf(s[id],warp.shfl_down(s[id],4));
30 s[id] = fmaxf(s[id],warp.shfl_down(s[id],2));
31 s[id] = fmaxf(s[id],warp.shfl_down(s[id],1));
// store block max difference
32 if(id == 0) smax[blockIdx.x] = s[0];
33 }
34 }
• Line 10: The declaration is similar to that of reduce6 except that we have added a second input
array b to the argument list and the data type of the array arguments is now the template parameter T
rather than float. The parameter T also replaces explicit float elsewhere in the code.
• Lines 11–17: These are identical to reduce7.
• Lines 18–21: This is the main data reading loop where threads accumulate their contributions to the
subsequent reduction step. The pointer to the second data array b is used as a flag here to decide if
this is an initial call to this function, in which case the maximum of the current value of a[tid] is
stored in s[id] or if b is not equal to nullptr the absolute value of the difference between
a[tid] and b[tid] is stored in s[id].
• Lines 22–32: These lines are the same as lines 26–38 of reduce6 except that we have replaced the
+= operators with =fmax(. . .) in order to find a maximum value rather than perform a summation.
• Line 35: The array_diff_max function also has its array arguments templated.
• Lines 37–38: Create two local arrays to hold the set of block-wise maxima from the first pass and the
global maximum from the second pass.
• Lines 39–40: Call the array_diff_max kernel twice after which the global maximum is stored in
d[0].
• Line 42: Return the result.
A call to array_diff_max can now be easily incorporated into Example 4.1 as shown
in Example 4.5
. . .
58 for(int k=0;k<iter_gpu/2;k++){ // ping pong buffers
59 stencil2D<<<blocks,threads>>>(dev_a.data().get(),
dev_b.data().get(), nx, ny); // a=>b
60 stencil2D<<<blocks,threads>>>(dev_b.data().get(),
dev_a.data().get(), nx, ny); // b=>a
60.1 if(k>0 && k%5000==0){ // check once in 5000 iterations
60.2 cudaDeviceSynchronize();
60.3 float diff = array_diff_max<float>
(dev_a.data().get(),dev_b.data().get(),nx,ny);
118 Parallel Stencils
60.4 if(diff<1.0e-16){
60.5 printf("converged k=%d diff=%10.3e\n", k*2, diff);
60.6 break;
60.7 }
}
61 }
62 cudaDeviceSynchronize();
. . .
Example 4.5 shows a modification to the Example 4.1 host code to include a convergence
check. In this version the user supplied parameter iter_gpu becomes the maximum number
of iterations performed. Line 60.4 shows a convergence test using array_diff_max with a
cut-off value specified as 10‒16. In practice, the latter value could be user specifed for tuning
tests. Note line 60.1 causes the convergence check to be made once every 5000 passes through
the loop, making its effect on performance small.
There is, in fact, an analytic solution to equation r2 ϕ ¼ 0 with our boundary conditions, it
takes the form of a Fourier series in the y coordinate and is given by Eq. 4.10,
X
∞
4 sinh ðxnπ Þ þ sinh ðð1 xÞnπ Þ
ϕðx; yÞ ¼ sin ðnπyÞ , (4.10)
n¼1, 3, 5...
nπ sinh ðnπ Þ
sanet.st
where the summation is over all positive odd integers. We can use this formula to test the
accuracy of our stencil calculation. Since we are interested in high performance we will
discuss results for a large square of size 1024 1024. In Table 4.2 we show results for both
the 4-byte float version of stendcil2D as shown in Example 4.1 and an 8-byte double version.
The fourth column of Table 4.2 shows the maximum difference between corresponding values
of the a and b arrays after the indicated number of iterations; a value of zero in this column
indicates exact convergence. The fifth column of the table shows the average absolute differ-
ence between the final values in a and reference values calculated using Eq. 4.10. These
averages are calculated using a central square of size 960 960 as the Fourier series in Eq. 4.10
converges very slowly for values of y near to zero and one. We can see that the best accuracy is
(not surprisingly) achieved at convergence and is about two parts in 10‒3 using floats and four
parts in 10‒7 using doubles. However, the number of iterations required to reach convergence is
large and so are the times required: 33 seconds using floats and 260 seconds using doubles. One
interesting feature is that the performance in GFlops/sec achieved using doubles is about half of
that achieved using floats. This is an unexpectedly good result for the GTX 2070 GPU used
here because this gaming card has only 1 64-bit ALU for every 32 32-bit ALUs in the hardware.
The fact that stencil calculations are memory bound not CPU bound explains this result.
coefficients to add up to unity, otherwise the total sum of the array contents would
monotonically increase or decrease. If we initialise the inside of the array to zero, then only
non-zero boundary values can produce change. These boundary values gradually diffuse into
the array travelling inward by at most one element per iteration. At the same time the
incorrect starting values diffuse out towards the boundaries, where they are overwritten by
the fixed boundary values. Eventually only correct values remain inside the grid and
convergence has been reached. Clearly the convergence would be quicker if the grid were
initialised with better approximations to the correct answer. Approximate starting values are
easily found by solving a smaller grid first and using that result to populate starting values for
a bigger grid. We have implemented this method doubling the size of the array dimensions at
each step. We refer to this as cascade iterations and our code is shown in Example 4.6.
14 int mx = nx/2;
15 auto idx = [&nx](int y,int x){ return y*nx+x; };
16 auto mdx = [&mx](int y,int x){ return y*mx+x; };
36 if(type==1)cascade<double>(ny,nx,iter_gpu); // double
37 else cascade<float> (ny,nx,iter_gpu); // float
sanet.st
38 std::atexit([]{cudaDeviceReset();}); // safe reset
39 return 0;
40 }
50 template <typename T> int cascade(int nx, int ny, int iter)
51 {
52 int nx_start = std::min(nx,32);
53 int size = nx_start*nx_start; // square array only!
54 thrustDvec<T> dev_a(size); // initial buffers
55 thrustDvec<T> dev_b(size);
56 thrustDvec<T> dev_aold(size);
67 zoomfrom<T><<<blocks,threads>>>
4.2 Cascaded Calculation of 2D Stencils 121
84 thrustHvec<T> a(nx*nx);
85 a = dev_a;
86 char name[256]; sprintf(name,"cascade%d_%d.raw",
nx,(int)sizeof(T));
87 cx::write_raw(name,a.data(),nx*nx); // square
88 printf("cascade time %.3f ms\n",t1);
89 return 0;
90 }
• Line 10: The kernel arguments are the arrays a and b of dimensions nx × ny which are to be
filled with the upscaled contents of aold. The array aold is assumed to have dimensions of
nx/2 × ny/2.
122 Parallel Stencils
• Lines 12–14: The kernel is designed to be called with a 2D thread grid of size at least nx × ny.
Here we find the current thread’s x and y in the standard way.
• Lines 14–16: Here we define lambda functions for 2D addressing of the arrays; idx uses a stride of
nx between rows and mdx uses a stride mx set to nx/2.
• Lines 17–27: This set of if statements causes each thread to either copy a value from aold to both
a and b or to copy the appropriate boundary value instead. Note there is a lot of thread divergence in
this kernel but performance is not an issue as this kernel is typically only called once every few
thousand iterations. This kernel has been written in a way which makes our intentions clear. Note the
upscaling algorithm in line 17 is crude. We did experiment with smoother interpolations but found
these did not improve performance.4
• Lines 30–40: This is the main routine:
○ Line 32: Set the flag type which is used in lines 36–37 to call the template function cascade as
either single precision or double precision. This template choice then propagates down to all
subsequent templated functions and kernels. This structure demonstrates how to effectively
template an entire program.
○ Lines 33–35: Set the final array size and the maximum number of iterations per size step. Note that
nx is assumed to be a power of two and that the array is a square of size nx × nx. The
y-dimension ny is not used here but is intended for a future generalisation of this code.
In a production version of this code additional parameters would be settable by the user.
The convergence parameters check and diff_cut which are set in lines 68–69 of the
cascade function are prime candidates for performance tuning experiments.
○ Lines 36–37: Call the cascade function with the desired precision.
○ Line 38: Our standard thrust safe CUDAsanet.st
exit for profilers.
Lines 50–90: This is the cascade function which does all the work; in a non-templated version it
could have been part of the main program.
• Line 50: Here are the function arguments at the final array dimensions nx and ny and the required
number of iterations iter. Note ny is not actually used in the version which tacitly assumes
square arrays.
• Lines 52–53: Here we set nx_start to the value 32 which is the starting array dimension.
Pedantically we allow for the case of small starting array sizes by using min, but this would not
be a sensible choice. The starting array is as a square array of dimension nx_start.
• Lines 54–56: Here we declare three working device arrays, dev_a and dev_b which are used as
before for the flip-flop stencil2D iterations and dev_aold which holds the final result from the
previous pass through the main loop in lines 58–82.
• Lines 58–59: This is the for loop over the array sizes; mx holds the current array dimension, which
starts at 32 and doubles on each pass up to the final size nx. In line 59 my is set equal to mx.
• Lines 60–61: Set the usual launch parameters blocks and threads to span the current array size
using 2D tiles of size 16 16.
• Lines 63–66 Here we resize dev_a and dev_b to fit the current array sizes. This step is not needed
on the first pass through the loop. The resize member functions of Cþþ container classes are not
particularly efficient but their use is strongly recommended for cases like this – it keeps your code
simple and makes your intention clear. Since most of the time will be taken iterating with
stencil2D, efficiency elsewhere in this loop is not critical.
• Line 67: The kernel zoomfrom is called here with the three working arrays as arguments. After this
call the arrays a_dev and b_dev are initialised with the required boundary conditions on their edges
and a scaled-up version of the previous iteration, stored in dev_aold, in their interiors. On the first
4.3 3D Stencils 123
Table 4.3 Results from cascade method using 4-byte floats and arrays of size 1024 1024
pass through the loop all three arrays are set to zero so the effect of the call here is to set the correct
boundary conditions in a and b while leaving their interiors set to zero. On subsequent passes
dev_aold will not yet have been resized from the previous pass and its interior values will be
upscaled into a and b.
• Lines 68–69: Here we set the parameters check and diff_max which control the frequency of
convergence checks during the stencil iterations and the allowed difference between a and b values
used by the convergence test.
• Lines 70–78: This is the iteration loop with pairs of flip-flop calls to stencil2D in lines 71 and 72;
after each pair of calls the most recent result is in dev_a.
• Lines 80–81: Here we prepare for the next upscale pass by resizing dev_aold to match the current
size of dev_a and then copying dev_a to dev_aold.
• Lines 84–88: Here we finish up by copying the final result back to the host, writing it to disk and
printing timing information.
Some results obtained with the cascade version of stencil2d are shown in Table 4.3.
The cascade method is more accurate and faster than simply using all zeros as starting
values. In the case of floats, we can achieve an accuracy of ~2 10‒3 in 1000 iterations
instead of 1,000,000 required before. The corresponding speed-up is a factor of 500. For the
more accurate calculations using doubles the performance gains are more modest; we can
achieve a final accuracy of ~3.6 10‒7 in 300,000 iterations compared with over 5,000,000
required before. This is a speed-up of about 16. The rapid convergence of the cascade
approach is particularly helpful for 3D stencil calculation which we discuss next.
4.3 3D Stencils
One really important benefit from the increased processing power of GPUs is the ability to
tackle 3D versions of problems which are demanding as 2D problems on single processors.
124 Parallel Stencils
Problems such as stencil calculations are a great example of this. A 2D kernel will typically
be written so that each x-y element (grid point of image pixel etc.) is possessed by a separate
thread. When converting a CUDA kernel from a 2D to 3D version we have two options:
1. Add a for loop over the z dimension to your existing kernel so that one thread processes
all points with a fixed x and y stepping through the full z range.
2. Add more threads to the kernel launch so that each point in the 3D grid is processed by a
different thread.
Example 4.7 shows two versions of a stencil3D kernel using these two approaches.
14 int x = blockIdx.x*blockDim.x+threadIdx.x;
15 int y = blockIdx.y*blockDim.y+threadIdx.y;
16 if(x < 1 || y < 1 || x >= nx-1 || y >= ny-1) return;
sanet.st
23 int x = blockIdx.x*blockDim.x+threadIdx.x;
24 int y = blockIdx.y*blockDim.y+threadIdx.y;
25 int z = blockIdx.z*blockDim.z+threadIdx.z;
26 if(x<1 || y<1 || x>= nx-1 || y>= ny-1 || z<1
|| z>= nz-1) return;
27 b[idx(z,y,x)] = (T)(1.0/6.0)*(
a[idx(z,y,x+1)] + a[idx(z,y,x-1)] +
a[idx(z,y+1,x)] + a[idx(z,y-1,x)] +
a[idx(z+1,y,x)] + a[idx(z-1,y,x)] );
28 }
4.3 3D Stencils 125
The times shown are for 10,000 iterations using a 6-point stencil
• Line 13: Here we define a 3D version of our idx lambda function used to address the 3D arrays a
and b.
• Line 17: Here we add a for loop over the third array index z; note we exclude z=0 and z=nz-1
from the loop because only interior points are changed by the kernel. This complements line
16 which does the same job for x and y. The single statement in the loop replaces interior elements
of b with the average of the six nearest neighbours of the corresponding element in a. (Note the 2D
version uses four nearest neighbours.)
The stencil3D_2 kernel uses method 2 above:
• Line 25: Here we set the z value for the current thread from the kernel launch configuration. For this
to work the launch configuration needs to be changed as follows:
○ 2D launch (assumes a for loop over z in 3D kernels)
Somewhat to our surprise, using the RTX 2070 GPU and an array size of 256 256 256
we find stencil3d_2 performs about 33% faster than stencild3d_1 which is a big
difference. Once again this demonstrates that modern GPUs really benefit from lots of threads
to hide memory latency in memory bound problems. Table 4.4 summarises these results.
We note that the effective memory bandwidths shown in the table exceed the device
hardware limits of around 410 GB/sec by up to a factor of three, demonstrating the effective
use of local caching.
The details of our 3D host code are not shown here but are available online. For the 3D
problem, the boundary conditions are specified on the faces of the 3D cube, in the code we
choose to set the faces in the y-z planes at x=0 and x=nx‒1 to 1.0 and the remaining faces
to zero. The edges and corners of the volume are set to averages of the touching surfaces.
126 Parallel Stencils
10 sanet.st
__global__ void filter9PT(cr_Ptr<uchar> a,
r_Ptr<uchar> b, int nx, int ny, cr_Ptr<float> c)
11 {
12 auto idx = [&nx](int y,int x){ return y*nx+x; };
13 int x = blockIdx.x*blockDim.x+threadIdx.x;
14 int y = blockIdx.y*blockDim.y+threadIdx.y;
15 if(x<0 || y <0 || x >= nx || y >= ny)return;
19 uint f = (uint)(v+0.5f);
20 b[idx(y,x)] = (uchar)min(255,max(0,f)); // b in [0,255]
21 }
4.4 Digital Image Processing 129
• Line 10: The kernel arguments are the input image a and a buffer b to store the results of the filter;
these are passed as pointers to arrays of type uchar (unsigned char). The image dimensions
are passed as nx and ny and the nine filter coefficients are passed as a third array c of type float. In
Cþþ the type uchar behaves as an unsigned 8-bit integer and is commonly used for image
manipulation. Calculations involving 8-bit integers are often done using 16- or 32-bit variables to
hold intermediate result which are then either scaled to an 8-bit result or just copied back with a floor
and ceiling of 0 and 255 applied.
• Line 12: Defines idx to aid the addressing of 2D arrays.
• Lines 13–14: The kernel is designed for one thread to process one image pixel and hence requires a
2D launch configuration of sufficient size to span the input array. Then we find the x and y pixel
coordinates for the current thread.
• Line 15: A standard out-of-range check on the pixel coordinates. This kernel can deal with any size
of input image.
• Lines 16–17: Each thread needs to read the eight nearest neighbours of the pixel at (x, y), and we
must check that these are also in range. There are various ways of doing this and there is an
annoying complication that every thread has to make these checks even though the vast majority of
threads process pixels well inside the image boundaries. Here we use xl and yl to index the pixels
to the left and above (x,y) and xh and yh pixels to the right and below (x,y). We set these to safe
values using the fast CUDA max and min functions. This avoids using if statements which are
slower in kernel code. A side effect is the pixels on the edges of the image will have a slightly
different filter applied with one of two duplicate image values being used to replace non-existent
pixels. Arguably this is as good as any other method of dealing with boundary effects.
• Line 18: This is the general 9-point filter. Note that since the coefficients are of type float, the pixel
values will be promoted from uchar to float during the calculation; this is an important detail.
• Line 19: The filter result v is truncated to the nearest uint and stored in f.
• Line 20: The result in f is converted to uchar with a floor of 0 and ceiling of 255 applied and then
stored in the output array b.
The filter9PT kernel is short and simple and delivers between 100 and 900 GFlops of
performance depending on image size. This is equivalent to more than 3000 frames per second
for a large image of size 4096 4096 and is more than enough for any real time display
applications which require at most 120 frames per second. However, many image processing
applications are for the analysis of scientific data and for these we may desire more perform-
ance at the price of complicating our kernels. Our image filtering kernel is still limited by
memory access and can be improved. The first simple step is to notice that the coefficients
130 Parallel Stencils
c[0]–c[8] used in line 18 are being read from global memory by each thread. A faster
solution is to store these parameters in GPU constant memory as shown in Example 4.9.
Example 4.9 filter9PT_2 kernel using GPU constant memory for filter coefficients
. . .
05 __constant__ fc[9]; // declaration has file scope
. . .
10 __global__ void filter9PT_2(cr_Ptr<uchar> a,
r_Ptr<uchar> b, int nx, int ny)
11 {
. . .
18 float v = fc[0]*a[idx(yl,xl)] + fc[1]*a[idx(yl, x)] +
fc[2]*a[idx(yl,xh)] + fc[3]*a[idx(y ,xl)] +
fc[4]*a[idx( y, x)] + fc[5]*a[idx(y ,xh)] +
fc[6]*a[idx(yh,xl)] + fc[7]*a[idx(yh, x)] +
fc[8]*a[idx(yh,xh)];
. . .
21 }
. . .
// copy c on host to fc on device constant memory
45 cudaMemcpyToSymbol(fc, c.data(), 9*sizeof(float));
. . . sanet.st
Example 4.9 is a minor modification to Example 4.8; in line 10 we have removed the final
argument containing the filter coefficients c and replaced them with an array fc in GPU
constant memory declared at file scope. Also, in line 18 we have replaced c with fc to pick
up the values in the global array. Line 45 is a line from the new main routine using
cudaMemcpyToSymbol to copy the coefficients from the host array c to the device global
array fc. Constant memory declared at file scope can also be initialised directly by
something like __constant__ float fc[9] = {0,‒1,0, ‒1,0,1, 0,1,0}.
CUDA constant memory is a block of special GPU memory, currently of size 64 KB on all
NVIDIA GPUs, which is shared as read-only memory by all SM units. Constant memory has
separate caching and can be efficiently read by all threads in a warp. It is intended for
parameters in just the sort of case discussed here. Some constant memory is always reserved
by the system but 48 KB is available for use by kernel code. Actually, all kernel arguments
passed by value are stored in constant memory by the system so there is no need to explicitly
do this as an optimisation. You can also use a kernel argument as a variable in kernel code,
but if you do then that argument will be copied to a local register by the compiler and the
advantage of constant memory is lost. There is a limit of 4 KB on the total size of all
arguments passed to a kernel.
The effect of this small change is to speed up the kernel by up to ~25% for larger images.
Encouraged by this quick win, we can look at ways of reading the elements of the array
more efficiently.
4.4 Digital Image Processing 131
In Examples 4.8 and 4.9 there are two issues with the way the elements of the input array
a are read. Firstly, they are read from global memory as single bytes and secondly each
element is read by nine different threads. We attack both issues by reading from global
memory using the 32-byte uchar4 type and then placing the values in shared memory. This
is a form of vector loading which has been discussed previously.
The kernel filter9PT_3 implements these ideas and is shown in Example 4.10. Our
new kernel is considerably more complicated, but it does give a further speed-up of about
30%. In this kernel each thread handles four elements of a stored in the consecutive bytes
of one 32-bit word. The kernel is specifically designed for 2D thread blocks of dimension
16 16 and shared memory is allocated as a 2D array of 66 18 bytes which is sufficient to
hold a central core of 64 16 bytes surrounded by a halo of single bytes. The central core
allows each thread to load 4 bytes as a single uchar4 and then process them as individual
bytes. The external halo is 1-byte deep which is sufficient for a 3 3 filter. Note that
explicitly loading the external halo used here is more complicated than implicitly loading
internal halo as was done for stencil2D_sm in Example 4.2.
• Line 13: Here we define the lambda function idx used for 2D addressing of the arrays a and b.
• Lines 14–16: Here we define some base variables used to address the active tile in both global and
shared memory:
○ Line 14: x0 and y0 are the 2D coordinates of the top left-hand corner of the tile being processed
in global memory; note these are multiples of 64 and 16 respectively because 16 consecutive
threads processing one horizontal line of the tile each process 4 elements.
○ Line 15: xa and ya are the starting position of the current thread in a and b.
○ Line 16: x and y are the starting position of the current thread in the shared memory array as.
• Line 17: Here we copy four bytes from a to the local uchar4 variable a4 in a single memory
transaction; a reinterpret cast is needed to do this. Also because the input argument a is declared
const, we must also declare a4 as const.
39 as[threadIdx.x+1][65] = a[idx(y0+threadIdx.x,xrgt)];
40 }
41 __syncthreads();
42 uchar bout[4];
43 for(int k=0;k<4;k++){
44 float v = fc[0]*as[y-1][x-1] + fc[1]*as[y-1][x] +
fc[2]*as[y-1][x+1] + fc[3]*as[y ][x-1] +
fc[4]*as[y ][x] + fc[5]*as[y ][x+1] +
fc[6]*as[y+1][x-1] + fc[7]*as[y+1][x] +
fc[8]*as[y+1][x+1];
45 uint kf = (uint)(v+0.5f);
46 bout[k] = (uchar)min(255,max(0,kf)); // b in [0,255]
47 x++;
48 }
49 reinterpret_cast<uchar4 *>(b)[idx(ya,xa)/4] =
reinterpret_cast<uchar4 *>(bout)[0];
50 }
• Line 18: Individual bytes from a4 are copied to consecutive locations of shared memory using four
separate 1-byte copy statements. Note at this point the 256 threads in the thread block will together
have filled the entire central 64 16 byte tile. In comparison, lines 19–40 are needed to load
the halo.
(We have experimented using a single uchar4 copy here, but this requires extra padding of the
shared memory array to ensure its central tile is aligned on a 4-byte memory boundary and with the
RTX 2070 GPU the resulting code is slightly less fast).
• Lines 19–29: The test on y selects threads 0–15 from warp 0. These 16 threads load the top (y=0)
line of the halo (lines 20–22) and the left-hand edge of the halo (lines 27–28). Note that in line 28
threadIdx.x is used as a y index for both as and a. This is a poor, but here unavoidable,
memory access pattern. A single thread, with threadIdx.x = 0, then also loads the corners of
the top line of the halo (lines 23–26).
• Lines 30–40: These are similar to lines 19–29 but use threads 0–15 of warp 1 to load the bottom and
right-hand edge of the halo. Because we use different warps in these two sections they can run
simultaneously on the GPU. There is a tacit assumption here that every thread block has at least 2 warps.
• Line 41: A __syncthreads() is necessary here to wait until the filling of shared memory
is complete.
• Lines 42–48: Here we see the actual work of the kernel; each thread will handle 4 data elements
from as.
○ Line 42: The uchar array bout is used to store results for the for loop in line 43.
○ Line 43: Here we use a 4 step for loop which processes one element on each pass. The elements of
data addressed by x and y in this section of code and thus we increment x at the end of each pass
but leave y unchanged.
○ Line 44: Here finally is the actual filter calculation. The result is calculated using floating
point arithmetic.
134 Parallel Stencils
○ Lines 45–46: The result is rounded to the nearest integer, clamped to be in the range [0,255] and
stored in one element of bout.
• Line 49: The four results in bout are copied to the output array b in a single memory transaction
using casts.
This example illustrates that although the concept of tiles with halos for stencils and filters is simple
and a good fit for GPU architecture, the implementation details can be messy. Note by using external
halos we have been able to ensure that the central 64 16 byte active tile is always correctly aligned
on a 32-bit address boundary – this is essential for the use of uchar4 vector loads. Preserving correct
alignment would not have been possible had we used an internal halo. The technique of using
individual warps (or sub-warps in this case) to perform different tasks within a kernel is also
noteworthy. This helps when there is a lot ofsanet.st
messy detail to deal with in a kernel. Different warps
can run different tasks simultaneously without thread divergence issues. The example could be
changed to use cooperative groups to define 16-thread tiled partitions to clarify our intentions in lines
19 and 30.
Table 4.5 shows the performance achieved by the three versions of the filter9PT kernel.
The numbers given are the compute performance in GFlops and memory throughput in
GB/sec. The timings are for our RTX 2070 GPU and we have assumed 17 floating point
operations and 10 bytes of global memory accesses per kernel call. For the smallest image
size there is little difference between the kernels, but as image size increases while all kernels
improve their performance, our final kernel (c) using shared memory does best, outperform-
ing the original kernel by up to factor of 2.4. The memory bandwidths also increase with
image size and for the largest image sizes exceed the uncached hardware limit by a factor of
up to 3.
" 2 #1=2
∂ 2 ∂
jr j ¼ þ : (4.10)
∂x ∂y
We base our Sobel kernel sobel6PT on the kernel filter9PT_3 which performed
fastest and can easily be adapted by changing its line 44 from a 9-point filter calculation to a
function call. The fragment of the kernel sobel6PT which implements the Sobel filter is
shown in Example 4.11. The device function sobel receives the values of the pixel being
processed and its eight nearest neighbours. The function call is shown as a modified line
44 from Example 4.10 and the sobel function is shown in lines 60–66.
Notice that the function arguments are passed by reference; this means that this function
call has very low overheads; it will use its arguments directly from the shared memory
registers passed in line 44. Lines 62 and 63 show the calculation of the two derivative filters.
We will end this chapter with one more filter, the median filter which requires us to think
about parallel sorting methods.
the median filter, which simply replaces a target pixel by the median value of itself and its
neighbours. Median filters are especially useful for removing single pixel noise from images
captured by CCD cameras often used for scientific data. This is illustrated in Figure 4.5
where we have added fake noise to our standard image by setting a random 1% of the pixels
to zero. Figure 4.5 (a) shows the original noisy image and Figures 4.5 (b) and (c) show the
results of applying a 3 3 smoothing filter and a 3 3 median filter. The smoothing filter
has turned single black pixels into 3 3 grey squares while the median filter has eliminated
them completely albeit with some slight blurring of the image.
Implementing a parallel median 3 3 filter on a GPU is moderately challenging. For
image processing we would like to keep to sanet.st
an implementation where one thread handles one
pixel. Thus each thread has to sort 9 numbers and pick the middle one. Additionally, we want
to avoid conditional statements which might lead to thread divergence. Our key tool is the
non-divergent swap function a_less shown in Example 4.12:
On a CUDA GPU this function is faster than the equivalent using an if statement to
compare the two arguments. Notice that the function arguments are passed by reference so
that they can be changed in the calling routine. The inline directive means that if the
arguments a and b are stored in registers then these registers will be used directly and zero
overheads are associated with the function call.
4.6 Median Filter 137
If we imagine the nine elements are stored in nine locations from 1 to 9, a naïve sorting
algorithm would be as follows:
Step 1: Call a_less 8 times with elements 1–8 as first argument and element 9 as second
argument. After this element 9 will be the largest.
Step 2: Repeat for elements 1–7 as first argument and element 8 as fixed second argument.
After this element 8 is the second largest
Steps 3–8: Continue the process with one less element each time.
At the end of all the steps the entire nine elements will be sorted into ascending order. For
just finding the median we can stop after five steps after which element 5 is the desired
median value. The device function median9 which implements this 5-step algorithm is
shown in Example 4.13; in all 30 calls to a_less are needed.
82 return a5;
83 }
Notice that in median9 the input arguments a1 to a9 may have their values exchanged
by calls to a_less; it is thus necessary for these arguments to be passed by value and not by
reference. The actual values are stored in shared memory by the calling kernel and so their
values need to be preserved for use by multiple threads.
This kernel is quite fast and can process a large image of size 4096 4096 in about
0.5 ms, which is more than 1000 times faster than the same algorithm running on a single
core of the host CPU. Nevertheless, it is possible to improve the code by removing some
unnecessary calls to a_less from the median9 kernel. It is, in fact, possible to sort
9 numbers with just 23 calls to a_less rather than 30. The scheme for doing this is shown
in Figure 4.6 which shows Batcher sorting networks for four and nine numbers. Each vertical
138 Parallel Stencils
line in the diagram represents a call to the a_less function, after which the grey point at the
top of the line is the larger of the two numbers and the black point at the bottom of the line is
the lower number.
In Figure 4.6 each vertical bar represents a sort between two numbers after which the
smaller (shown in black) is below the larger value (shown in grey). After the sort, the median
of the nine numbers in the right-hand diagram will be on the central line. These networks
have complexity of order Nlog2(N) and hence become less efficient for large N.
As shown in the figure the full sort of nine numbers requires 23 pairwise tests. However,
23 tests are more than required to simply find the median. We will follow the method of
Perrot et al.,7 which is based on the observation that given a set of 2Nþ1 numbers a sort of
any subset with Nþ2 or more numbers have a maximum and minimum value neither of
which can be the median. Thus, to find the median of nine numbers we start with a Batcher
network to sort six numbers and find and discard the max and min. We add one of the
unused numbers the remaining four numbers from the original six and repeat with the
resulting set of five numbers and so on discarding the max and min each time until we find
the median. The resulting median finding network has 20 comparisons as shown in
Figure 4.7.
The code to implement this median filter is shown in Example 4.14.
4.6 Median Filter 139
• Line 44: This is the modified version of line 44 from Example 4.10 showing the call to the device
function batcher9 described below. Since this function returns a result of type uchar it can be
directly stored in the output array b.
• Lines 70–74: The function medsort6 implements the 6-item sort from Figure 4.7. It is called with
arguments a4-a8 in line 92. There are 7 calls to a_less here.
• Lines 75–87: These lines define three more functions to perform the remaining sorts of five, four and
three items as shown in Figure 4.7. There are a further 13 calls to a_less here, making a total of
20 calls in all.
• Lines 88–97: Here we define the __device__ function batcher9 which takes the 9 image pixel
values centred on a5 as arguments. These are deliberately passed by value so that they can be safely
modified when by a_less.
The times taken by the two median filter kernels, median9 and batcher9, to process a large
uchar image of size 4096 4096 pixels were 0.534 and 0.374 ms respectively. These times
are just the kernel run times ignoring the overheads of loading the image and storing the
result. Nevertheless, the filter is impressively fast and certainly fast enough for real time
applications such as real time denoising of video from CCD imaging cameras run at high
frame rates and high gain for astronomical or other low light applications. The equivalent
host code running of a single CPU core is about 1000 times slower, requiring 620 and 307
ms per frame for the two versions of the filter.
Batcher networks could of course also be used for simple sorting operations although for
sorting a large number of values other algorithms such as a radix sort are better. In fact
parallel sorting is another “classic” GPU application and like reduce is often discussed in
CUDA tutorials. If you are interested in this topic the CUDA SDK is a good place to start;
the 6_Advanced directory has three examples cdpAdvancedQuicksort,
mergeSort and sortingNetworks which illustrate some of the methods. If your
application needs to sort large arrays it is probably best to start with the available libraries.
For example CUDA thrust has good support for sorting arrays.
. . .
44 bout[k] = batcher9<uchar>(as[y-1][x-1],
as[y-1][x], as[y-1][x+1], as[y ][x-1],
as[y ][x], as[y ][x+1], as[y+1][x-1],
as[y+1][x], as[y+1][x+1]);
. . .
70 template <typename T> __device__ __inline__
void medsort6(T &a1,T &a2,T &a3,T &a4,T &a5,T &a6) {
140 Parallel Stencils
In this chapter we have introduced the idea of stencils and image filters which both require
essentially the same coding approach. Explicit use of NVIDIA GPU constant memory
was used for the first time. In the next chapter we continue the theme of image manipulation
and introduce another (and the last) type of specialised GPU memory – the texture.
4.6 Median Filter 141
Endnotes Chapter 4
1 Equal spacing is a simplifying assumption; more advanced finite element-based methods (FEM) use
variable point spacings adapted to the local needs, i.e. more points in regions where more detail
is needed.
2 This is not entirely true for recent GPUs. Cooperative groups provide a kernel wide function for grid
synchronisation analogous to syncthreads() but this comes at the price of limiting the number of thread
blocks in the kernel launch, which for many applications would lead to a performance drop outweighing
the gain of needing fewer kernel launches.
3 A kernel using thread blocks of size 16 16 and shared memory can either load tiles of size 16 16
including the halo or tiles of size 18 18 including the halo. The former makes loading shared memory
easy but then only 14 14 threads to perform the calculation. The latter allows all 16 16 threads to
perform useful work but this approach significantly complicates the tile loading process. We can think of
this as a choice between inner or outer halos.
4 In the next chapter on image processing, we will look at a number of interpolation strategies for
rescaling images.
5 This indeed is the origin of the term pixel. According to Wikipedia (https://en.wikipedia.org/wiki/Pixel)
the first published use was in 1963 in connection with images from space craft. These days pixel is
widely used for both digital images and the screens on which they are displayed. In 3D datasets, for
example, volumetric MRI scans, the term voxel has more recently been introduced for a
volume element.
6 We have used a 512 512 portion of a holiday picture of St Ives harbour converted to greyscale.
7 Perrot, G., Domas, S. & Couturier, R. Fine-tuned High-speed Implementation of a GPU-based Median
Filter. Journal of Signal Processing Systems 75, 185–190 (2014).
5
Textures
The previous chapter discussed filtering of 2D images by changing the values of individual
pixels based on the values in that pixel’s neighbourhood. However, many pixel image
transformations, including image resizing and rotation are more complicated in that there
is not a simple one-to-one correspondence between the pixels before and after the
transformation.
Digital images are often used to represent mathematical distributions or scientific data. In
these cases, the discrete image pixels are used as an approximation to a “real” continuous
distribution and the pixel value is understood to be an average for the region represented by
that pixel. A digital image will always be displayed in print or on a screen as a 2D
rectangular grid of pixels having a horizontal x-axis and vertical y-axis. We follow the
Cþþ convention of addressing a row of N pixels with an integer ix index running from 0 to
N‒1 and a column of M pixels with an integer iy index running from 0 to M‒1. The origin is
sanet.st
in the top left-hand corner of the image. This means that for pixels of unit size, the ix and iy
coordinates are the displacements from the top left-hand corner of the image as illustrated in
Figure 5.1.
Figure 5.1 shows the pixels of an image with 10 3 pixels where the integer pixel
coordinates ix and iy follow the Cþþ index convention. If the pixels have unit size then
rows are 10 units wide and the columns are 3 units deep. The centres of the pixels are at the
centre positions indicated by dots. In quantitative work a digital image is regarded as a
sampling of a continuous 2D function f(x,y) where the value in each pixel is the integral of f
over the area of the pixel.1 In turn this means that the pixel values are estimates of the value
of f(x,y) at the geometric centres for each pixel which are obtained by adding 0.5 to the
integer pixel coordinates as shown in Figure 5.1.
Thus, if the pixel with x and y coordinates (7,1) has a value of say 42 this means our best
estimate for the underlying continuous distribution is that it has the value 42 at position
(7.5,1.5) and not at (7, 1).2 This subtlety can often be ignored in simple image manipulation
tasks such as the filters used in Chapter 3. For other more complicated operations such as
scaling the image size by a non-integer value or rotation by an angle that is not a multiple of
90 we find the centres of pixels in the transformed image do not exactly correspond to
centres of pixels in the original image. In these cases, we have to calculate the transformed
image by interpolating values from the original image. It is important that we work
consistently with pixel centres when performing these interpolations. The main theme of
this chapter is how to use the texture hardware of NVIDIA GPUs to perform very fast
interpolation. Some of the applications we build around this hardware, such as image
registration, are important in many fields including medical image processing.
142
5.1 Image Interpolation 143
pixel values represent values of a continuous function at their central points. Note the
subtraction of 0.5 in CUDA texture image-interpolation is done by the CUDA lookup
function (tex1D in this case) not the user who supplies the program x value as a
function argument.
(c) This is the same as (b) but using tex1D(tex,x+0.5) instead of tex1D(tex,x),
we refer to this a “Linear Interpolation”. This method is useful in applications where
exact pixel values are expected at integer values of x. To use this method the user has
to specify xþ0.5 as an argument to the CUDA texture lookup function. The graph for
mode (c) is just the graph for mode (b) shifted half a unit to the left. This is the natural
mode to use for image processing because if x is set to a pixel index value i, then tex1D
(tex,x+0.5) returns the value in pixel[i]. This method can also be used for table
lookup where the texture represents a function sampled at six equally spaced points in
the range [0,5].
(d) It is more natural to perform “Table Lookup” with a function sampled at six equally
spaced points giving a variable range of [0,6] and we can do this with the same texture
by passing the value 5*x/6+0.5 to the texture lookup function. The graph in (d) is just
a stretched version of that in (c).
A nice feature of the texture hardware, illustrated in the figure, is that we do not have to
perform explicit out-of-range checks on the coordinates sent to the texture lookup functions –
the hardware takes care of this issue. The action performed on out-of-range coordinates
146 Textures
depends on the setting of the cudaAddressMode flag used when initialising the texture.
For example, to get out-of-range clamping to edge values, we must be set this flag to
cudaAddressModeClamp.
Another issue to bear in mind is that images displayed on a screen are restricted to 8-bit
values by hardware which reflects a limitation of the human visual system which is unable to
distinguish more than about 200 shades of grey. The original data may, however, be floating
point or integers of 16 or 32-bits. All the processing discussed here and in the previous
chapter can and should be done in the native precision of the data, and conversion to 8-bits
for display should be left as the last step. In many cases you may not be interested in
displaying your transformed image.
Curiously CUDA textures are rarely discussed in tutorials aimed at scientific applications
of CUDA; this may be because texture have their origins in gaming. Another possible reason
is that the coefficients used for interpolation in CUDA texture lookups are calculated with
only eight bits of precision. In practice, we have never found this to be a problem. At points
which represent exact values, exact values are returned, at other points for reasonable
functions interpolation errors are likely to be no more than one part in a thousand. In cases
where the function varies more rapidly it must be remembered that even exact linear
interpolation is only an approximation to the underlying function, so additional small errors
are unlikely to be important. Another possible issue is that textures are quite verbose to
setup. Here we solve that problem for you by using our cx library functions as wrappers that
hide much of the complexity. We think it is a pity that textures are not more widely used
because as we shall see, there are real performance gains to be had.
sanet.st
Figure 5.4 Image quality after rotation using nearest pixel and bilinear interpolations
5.4 The Lerp Function 147
image after a 30 degree rotation using nearest pixel and bilinear interpolation. On the right
we show a composite image of a vertical bar which has been rotated about a central point in
10-degree steps. The bar has the width of 2 pixels and the original position is vertical. Again,
the pair of images clearly show the results of using nearest pixel and bilinear interpolation.
The jagged appearance of the lines is sometimes referred to as aliasing. Notice rotations by
multiples of 90 are free from aliasing; such rotations are equivalent to simply exchanging
pairs of pixels.
We can use multiple lerps to build up both 2D bilinear and 3D trilinear interpolations
using straightforward code. Example 5.1 shows functions for performing bilinear inter-
polation and nearest pixel interpolation on either the host or the GPU.
sort of code always needs some sort of method for dealing with image boundaries and there is no
unique best solution. Using texture hardware avoids this complication.
Example 5.1 Bilinear and nearest device and host functions for 2D
image interpolation
// x interp at y1
21 float ly1 = lerp(a[idx(ky1,kx1)],a[idx(ky1,kx2)],ax);
sanet.st
// x interp at y2
22 float ly2 = lerp(a[idx(ky2,kx1)],a[idx(ky2,kx2)],ax);
// y interp of the x interpolated values
23 return (T)lerp(ly1,ly2,ay);
24 }
• Lines 14–15: Here we use the fast CUDA floorf function to find the two pixels with lowest x or y
values for the 4-pixel tile to be used.
• Lines 16–17: The a and b parameters from Figure 5.2 are found here as ax and ay.
• Lines 19–20: The tile coordinates are calculated and clamped to the ranges [0,nx‒1] and [0,ny‒1].
• Lines 21–23: Calculate the desired bilinear interpolated value is calculated from 4 pixels of a using
3 lerp calls.
○ Line 21: Set ly1 to value of a interpolated for x and with y fixed to y1.
5.4 The Lerp Function 149
○ Line 22: Set ly2 to value of a interpolated for x and with y fixed to y2.
○ Line 23: Return value of a for y interpolated between ly1 and ly2.
• Lines 25–32: These show the declaration of the nearest function which interpolates by simply
finding the pixel in which the input point (x,y) lies.
○ Line 28: Clamps to zero all pixels outside valid positions in the original image. This is different to
corresponding line 13 of bilinear which allows a halo of one pixel around the valid region.
This mimics the cudaAddressModeBorder addressing mode of the texture hardware.
○ Lines 29–30: Here we simply round x and y down to the nearest integer which gives the pixel
of the original image contained at the point (x,y). We use the clamp function from
helper_math.h to ensure that the rounded value ix and iy are both valid. This precaution
is not strictly necessary as line 32 has already ensured that ix and iy will be valid. However, we
keep it here for future development; if line 32 was bypassed by being made conditional on a flag
then the function would clamp out-of-range pixels to the nearest edge value rather than clamping
them to zero. This would then mimic the cudaAddressModeClamp addressing mode of the
texture hardware.
Example 5.2 shows code for image rotation consisting of the rotate1 kernel and a simple
main routine. The rotate1 kernel performs image rotation and uses the bilinear function
of Example 5.1. This example is a baseline and does not use hardware textures.
Example 5.2 rotate1 kernel for image rotation and simple main routine
68 thrustHvec<uchar> b(size);
69 thrustDvec<uchar> dev_a(size);
70 thrustDvec<uchar> dev_b(size);
This version of the GPU code for image rotation works reasonably well delivering over
600 GFlops of compute performance and running 500 times faster than a single core CPU
version. However as promised, we can do better by using the texture features of NVIDIA GPUs.
5.5 Texture Hardware 151
Example 5.3 rotate2 kernel demonstrating image rotation using CUDA textures
50 b[idx(y,x)] = (uchar)(255*tex2D<float>(atex,
xr+0.5f, yr+0.5f));
51 }
sanet.st
Description of Example 5.3
• Lines 40–51: The kernel rotate2 is a modification to the rotate1 kernel of Example 5.2; only
lines 40 and 50 differ for the corresponding lines in rotate1.
○ Line 40: The second argument is now the cudaTextureObject_t atex which holds the
original image data instead of a simple pointer to the data in GPU memory. Also the kernel is not a
template function; it is explicitly written to process data of type uchar.
○ Line 50: Instead of calling the function bilinear to perform interpolation about the point (xr,yr)
in the source array, we use the built-in CUDA tex2D function to interpolate from the texture
containing a, namely atex. In order to process 8-bit uchar data the values are actually stored
in GPU texture memory as floating point numbers normalised so that the largest representable
native value corresponds to 1.0. This is why we specify <float> in the tex2D template argument
(a default does not work here), scale the result by 255 and finally cast back to a uchar.
Another point to note is that the coordinates (xy,yr) are offset by +0.5f; this corresponds to
the linear interpolation mode of texture lookup shown in Figure 5.3 (c).
• Lines 60–79: These lines are a modified version of the main routine in Example 5.2. There are only two
changes. Line 69 is replaced by lines 69.1 and 69.2 and the kernel call in line 75 has been changed.
○ Line 69.1: The int2 object nxy is created and initialised to (nx,ny). It is needed in the next
If you are not familiar with CUDA textures line 69.2 may appear quite complicated; if you are
familiar with CUDA textures you may feel that our modified code is rather less verbose than you
expected. Either way we have gained a factor of up to 3.7 in the interpolation performance of our
code. The performance of our bilinear interpolation codes is shown in Table 5.2.
The first set of columns in Table 5.2 show the number of images processed per second by
using a single CPU on the host and the rotate1 and rotate2 GPU kernels. These numbers are
based on the total execution times for the whole calculation and then subtracting IO and
initialisation overheads. For the largest two image sizes a single CPU cannot deliver
processing at video rates but either GPU implementation is more than adequate. The second
set of columns in the table show the compute performance of just the bilinear interpolation
part of the calculation. These numbers were found by also running equivalent jobs with
bilinear interpolation replaced by simply storing fixed values in the output buffer. The
difference in times was taken as the time for interpolation. We assumed each interpolation
involved 20 floating point operations to convert execution times to GFlops per second.
Overall, the GPU implementation is about 500 times faster than the CPU and the texture-
based implementation is more than 1000 times faster. Interestingly, the best texture perform-
ance is achieved for intermediate of image sizes between 256 256 and 1024 1024. This
is in contrast to our usual finding that GPU performance tends to continue to improve as
dataset sizes increase. We conjecture that the different behaviour of GPU textures is because
the Morton ordering scheme, which makes local x-y neighbour addressing more efficient, is
154 Textures
optimal for these intermediate image sizes (see Appendix H for more detail). Compared to
the rotate1 kernel, textures give a factor of 3.7 speed-up for bilinear interpolation of
images having size 256 256 and at least a factor of 2 improvement for all image sizes.
The rotation of a rectangular image also presents the difficulty that the rotated image may
not fit into the original frame. This is illustrated in Figure 5.5 which shows the test image in
(a) and a 45 rotation in (b). Note that in (b) corners of the original image are missing and its
corners of (b) have black regions which were outside the area of the original image and have
been set to zero because cudaAddressModeBorder is specified in the texture initialisa-
tion in line 69.2 of Example 5.3. This issue can be fixed by padding the original image with a
sufficiently wide boarder as illustrated in Figure 5.5 (c) and (d).
We can use the normalised coordinate feature of CUDA texture lookups to easily implement
image rescaling. This allows an image of sanet.st
any size to interpolate a texture without needing to
know the actual resolution of the stored image. The modifications are shown in Example 5.4.
Example 5.4 rotate3 kernel for simultaneous image rotation and scaling
. . .
69.2 cx::txs2D<uchar> atex(nxy,a.data(), // pass host array
cudaFilterModeLinear, // do linear interpolation
cudaAddressModeBorder, // out of range pixels are zero
cudaReadModeNormalizedFloat, // return floats in [0,1)
cudaCoordNormalized); // coords in [0,1.0] [0,1.0]
. . .
75 rotate3<<<blocks,threads>>>(dev_b.data().get(),
atex.tex, angle, mx, my, scale);
. . .
• Line 40: Declaration of the rotate3 kernel. The image size arguments, previously nx and ny,
have been renamed mx and my to reflect a change of use – they are now only the size of the output
image frame. Because this kernel uses normalised coordinates for texture lookup there is no need for
the kernel to know the actual dimensions of the image stored in the texture. There is also a new final
argument scale which specifies a scaling factor for the output image within the mx x my frame.
A scale value of 1/√2 would allow an image rotated by 45 to fit inside its original frame.
• Lines 41–49: These lines are identical to rotate2 except that mx and my are used instead of nx and ny.
• Lines 50–51: These lines are new and set xs and ys to the rotated coordinates xr and yr but
normalised to the range [0,1] and then rescaled by a factor of scale. Both steps are done by
multiplying by scale/mx and scale/my. There are also terms to keep the image centre in the
centre of the frame and to deal with the 0.5 offsets needed for linear interpolation mode.
• Line 69.2: The main routine is changed to use the flag cudaCoordNormalized as the last
argument of the call to cx::txs2.
The images in Figure 5.1 (c) and (d) were produced using rotate3 with a scale factor of
1/√2 and rotations of 0 and 45.
Bilinear interpolation will always give a smooth result for any increase in the image size
or for decreasing the size by up to a factor of 2. For larger decreases, say by a factor of 8,
bilinear interpolation between adjacent pixels in the original image is equivalent to sparsely
sampling that image giving a potentially noisy result.
A better approach in this case is to use average values for 8 8 tiles spanning the source
image. Equivalently our code could be used three times to step the image size down by a
factor 2 each time. The last step could be any value less than 2 for cases which are not an
exact power of 2. This is illustrated in Figure 5.6 where our test image with original
156 Textures
resolution of 512 512 pixels has been downscaled by a factor of 8 using the rotate3 kernel.
The left-hand image using a single step with size = ⅛ and the right-hand image used
3 separate steps of ½.
• Line 40: The rotate4 kernel declaration only differs from rotate3 in that the first 2 arguments now
refer to uchar4 objects instead of uchar. This is explicit for the output array b and implicit for the
input texture atex4. Each element of the texture is now a 4-component vector of RGBA values for
one pixel and the texture lookup will automatically perform the same interpolation on each
component of the vector.
• Lines 41–51: These lines are unchanged from their equivalents in rotate3.
5.7 Viewing Images 157
• Line 52: Here we perform the vector texture lookup which is returned as a float4 result. Since the
definitions in helper_math.h do not include overloaded conversions between float4 and
uchar4 we have to store the result in the temporary variable float4 fb.
• Lines 53–56: Here each component of fb is scaled and then copied to a component of an element of
the output uchar4 b.
some medical ones such as Dicom. It also allows reading and writing of raw image data as
used with the cxbinio routines. The programsanet.st is well documented and here we will show
just one image, in Figure 5.7, of the dialogue for reading binary (or raw) image data. The
figure shows the main ImageJ window at the top with the menu options
ImageJ!Import!Raw selected together with the raw image properties dialogue which lets
you specify the raw image layout.
While separate image viewing tools are useful, incorporating image display directly into
your production code is sometimes more interesting as it opens the possibility of interactive
user interfaces. The tool we use here is OpenCV. This is a big project and consists of a very
large function library, routines of which you can incorporate into your own Cþþ code. The
website https://opencv.org/ is the place to go for downloading the libraries and finding
documentation. One of the features we like about this package is that very few lines of code
are required to directly read, display and write images of all many types including jpg, bmp
and png. It also handles video formats nicely; one can treat video frames as simply a set of
images. We only use a tiny number of the available features.
The host code to drive rotate4 including the use of OpenCV is shown in Example 5.6;
it is similar to the code used before except we have generalised the code to use OpenCV
instead of cxbinio.
. . .
05 #include "opencv2/core.hpp"
06 #include "opencv2/highgui.hpp"
5.7 Viewing Images 159
cudaReadModeNormalizedFloat, cudaCoordNormalized);
100 dim3 threads ={16,16,1};
101 dim3 blocks ={(uint)(mx+15)/16,(uint)(my+15)/16,1};
102 rotate4<<<blocks,threads>>>(dev_b.data().get(),
atex4.tex, angle, mx, my, scale);
103 b = dev_b; // get results
// NB order here is rows,cols
104 Mat out_image(my,mx,CV_8UC3,Scalar(0));
105 uchar4_to_opencv(out_image,b.data());
bmp or png.6 Note the dimensions of the input image are obtained directly from the image
metadata and do not need to be specified here by the user. This is a significant simplification.
○ Arg 2: The name of the output file containing the transformed image; again this can be any of the
usual image formats as specified by the file extension. The output image format does not have to
be the same as the input format, so one use for our program is to simply change image formats.
○ Arg 3: The rotation angle in degrees.
○ Args 5 and 6: The optional dimensions mx and my of the output image; they will default to the
same values as the input image. It is not necessary to preserve the image aspect ratio.
• Line 82: The powerful imread OpenCV statement reads and decodes the image file specified in
argv[1] and stores the pixel data in the OpenCV Mat object image. The Mat objects are another
example of a container class for arrays. When used for RGB images the data is stored in the order
BGR which can be confusing. The necessary conversion functions are shown above.
• Lines 83–84: The image dimensions are extracted from image and stored as nx and ny. The
OpenCV library has its origins in C not Cþþ and Mat objects are simple structs whose data
variables we can access directly as needed.
5.8 Affine Transformations of Volumetric Images 161
• Lines 85–88: Here we get the optional user parameters from command line arguments as usual. The
output image dimension mx and my are now the last two arguments so that the user can specify just
the angle and scale parameters without knowing the input image dimensions.
• Line 92: Here we create a familiar thrust host vector, a, of type uchar4 to hold a copy of the input
image pixel data.
• Line 93: Copy the pixel data from Mat object image to the vector a using the opencv_to_
uchar4 utility routine. The format is changed from BGR in the mat image object to RGBA in the
thrust vector a.
• Lines 94–96: Create thrust containers for the host output image b and corresponding device vectors
for both a and b.
• Lines 98–99: Create the required texture object atex4 containing the input pixel values, using the
cx::txs2D function as before with the same arguments. Note the template variable is now set to
uchar4 which is also the type of a.data().
• Lines 100–103: Here we run the rotate4 kernel and copy the results back to b as before.
• Line 104: Create the Mat object out_image which is used to display and save the result. The
parameters specify the matrix size (mx x my), type of elements (8-bit unsigned, 3-channels) and
initialise the matrix to zeros. This again is a powerful statement because it allows any array of data to
be displayed and saved as an image.
• Line 105: Copy the pixel data from the thrust vector b to the Mat object out_image using the
uchar4_to_opencv utility function.
• Line 107: In order for openCV to display an image, we first have to create a window in which to
display that image. We do this by calling the imshow function with two parameters; the first
parameter is the name given to the window and the second parameter is a defined constant indicating
the window type. Here we use WINDOW_NORMAL for the window type and the name of the output
file as supplied by the user for the window name. Rather unusually for this sort of package the
function imshow does not return handle to the new window; instead it uses the window name,
argv[2] in this case.
• Line 108: Display the Mat object out_image in the window created in line 107.
• Line 109: Here the openCV function waitkey is used to wait for the user to enter a keystroke. The
keycode is returned and here we use it to save the output file unless the escape key has been used.
The function imwrite is used to save the output image to disk with the user supplied name and
extension contained in argv[2].
We think our final reduc4cv program is a useful piece of code, it can resize your digital
images in a large variety of standard formats and convert between formats. Future enhance-
ments would be to allow enlarged images to be cropped in windows translated from the
image centre and perhaps to allow the mouse to be used for this purpose. For now, we leave
these improvements as exercises for the reader.
A typical MRI scan might contain 256 256 256 volume elements (or voxels the 3D
analog of 2D pixels). Commonly MRI scans are viewed as a stack of 2D slices, say 256
images. Each planar image is of dimension 256 256 and in the x-y plane represents a
“slice” along the z axis. But MRI data is more than a stack of images, it is truly volumetric
data and for many processing operations it is necessary to operate on complete datasets. One
example is registration where two or more datasets need to be aligned with each other in 3D.
The simplest transformations of a 3D volume are rigid rotations and translations, and even
these involve 6 parameters, 3 rotation angles around the x, y and z axes and displacements
along these axes. In fact, the most general linear or affine transformation of a 3D volume
involves up to 12 parameters, the six already mentioned plus three different scalings along
each axis and three shear transformations. We can represent a general affine transformation
using a 4 4 matrix as shown in Eq. 5.2. The vector ðx0 y0 z0 Þ is the result after the affine
transform. The 3 3 matrix a in the top left-hand corner represents the combined rotation,
scale and shear operations and the vector t with components ð t x ty t z Þ in the right-hand
column represents the translations.
2 03 2 32 3
x a00 a01 a02 tx x
6 y0 7 6 a10 a11 a12 ty 76 y 7
6 07¼6 76 7
4 z 5 4 a20 a21 a22 t z 54 z 5: (5.2)
1 0 0 0 1 1
Examples of individual affine transformation matrices are shown in Eq. 5.3
2 3 2 3 2 3
cos θ sin θ 0 sanet.st
sx 0 0 1 0 0
4 sin θ cos θ 0 5, 4 0 sy 0 5 and 4α 1 05 (5.3)
0 0 1 0 0 sz 0 0 1
and represent a clockwise rotation of θ in the x-y plane, scalings of sx, sy and sz along
the coordinate axes and a shear in the x-y plane. The sub-matrix a in Eq. 5.2 is the product of
a number of such matrices. Note the individual 3 3 transformation matrices do not in general
commute so one must be consistent in the order in which they are multiplied to generate a.
Our next example 5.3 in an implementation of 3D volume affine transformations using the
six rigid body transformations and a single scaling parameter applied isotopically (i.e. setting
sx = sy = sz in 5.3).
struct affparams {
float4 A0; // three rows of affine
float4 A1; // matrix, translations
float4 A2; // in 4th column
float scale;
};
The stuct affparms shown in the box is used to pass the affine parameter matrix from
the host to the GPU. Using a struct in this way has several advantages; it makes the kernel
5.8 Affine Transformations of Volumetric Images 163
argument lists more compact and, because structs are passed by value, the parameters will
automatically be stored in GPU constant memory. In addition, using structs in this fashion is
helpful during development as function interfaces do not change if the details of the struct
change. For example, if we changed scale to be a float3 to allow for differential scalings,
while the details of our code would change, the kernel arguments would not. The kernel
affine3D used to implement 3D affine transformations is shown in Example 5.7. The
additional parameter scale is used for debugging and is stored as the reciprocal of the user
supplied scale factor.
• Line 10: The kernel arguments are the 3D output array b, the 3D texture atex which holds the input
array a, the affine transform parameters passed by value in the stuct aff and the dimensions of the
input and output arrays passed as the int3 variables n and m. Note the data type is fixed as
ushort. This 16-bit type is commonly used in applications such as MRI. Since signed and
unsigned data require different scalings when using normalised texture coordinates we have not
attempted to make the type of b a template parameter.
• Lines 12–13: The variables ix and iy are the address of the pixel to be processed by the
current thread.
• Lines 16–18: In this kernel we use normalised coordinates in the range [0,1] to address the texture;
here we calculate the normalised distances between the output pixels.
• Lines 19–20: Because the affine transformation includes rotations, the origin of the normalised pixel
coordinates x and y is shifted to the centre of the image.
27 b[mdx(iz,iy,ix)] =
(ushort)(65535.0f*tex3D<float>(atex, xr, yr, zr));
28 }
29 }
• Lines 21–28: This is the loop over all z values for the fixed x-y position handled by the current
thread.
○ Line 22: Get the current normalised z value; this line corresponds to lines 19–20 for x and y.
○ Lines 24–26: Here the affine transformation is performed. In each line the first three terms
represent multiplication by the 3 3 a matrix in Eq. 5.2, the fourth term represents the
translations by t, note the scaling of tx by dx and similarly for y and z. This means the translation
is measured in pixels on the input image. The final term shifts the origin of the normalised
coordinates back to the (0,0,0) corner of the image.
○ Line 27: Here we perform the texture lookup with implicit trilinear interpolation. The result
sanet.st
is stored directly into the output array b. Note the scale factor used for ushort texture
lookups.
We do not show all the corresponding host code for this example because it is similar to
the previous examples except that the user is able to enter seven parameters for the affine
transformation. The resulting code allows users to apply 3D rotations and translations and
also to enlarge the image and optionally change aspect ratios by adjusting the dimensions of
the output array. Some output examples are shown in Figure 5.8 which shows single slices
from a 256 256 256 MRI head scan.
The slices shown in the Figure 5.8 are as follows:
(a) x-y slice from original image which is oriented in the axial or transverse direction (i.e.
with the perpendicular z-axis running from head to toes).
(b) Sagittal slice obtained by rotating original image by 90 degrees around by z and y axes.
(c) Coronal (front-back) view obtained by rotating original image by 90 degrees around x axis.
(d) Skewed slice obtained by rotating image by 45 degrees around the y axis. After this
rotation the new image
pffiffiffi width in the x direction exceeded the original width of 256 pixels
by up to a factor of 2. The view shown was obtained by doubling the size of the output
window with scaling factor of ½ and then manually cropping the result to 320 256.
(e) The coronal image in (c) is somewhat squashed in the vertical direction because the
original MRI voxels dimensions were not isotropic; in (e) this is corrected by using
256 320 256 for the output image. Note that while views (b) and (c) could be
generated by reordering pixels in the output volume without interpolation, view
(d) requires full 3D interpolation.
5.8 Affine Transformations of Volumetric Images 165
Figure 5.8 Affine transformations of a 256 256 256 MRI head scan
For timing purposes, we also wrote versions of the kernel that used the lerp function instead
of textures to perform 3D interpolation. The function interp3D is shown in Example 5.8; it
uses two sets of bilinear interpolations in x and y similar to the interp2D kernel of Example 5.2
followed by a single lerp in z to perform full trilinear interpolation using a total of 7 lerps.
Some timing results using our RTX 2070 GPU are shown in Table 5.3. The table shows the
number of volumes that could be processed per second assuming the data is already on the
GPU. The affine3D column shows the result for the kernel listed above, the interp3D column
shows the results for a kernel identical to this except that the interpolation is performed by
calling interp3D instead of using a texture lookup. The use of textures gives about a factor
of 5 performance gain, which in this context is huge! The host version is about 2000 times
slower than affine3D. Measurements with Nsight Compute confirm that the affine3D
kernel is delivering well over a TFlop/sec of performance.
166 Textures
45 float x1 = floorf(x-0.5f);
46 float y1 = floorf(y-0.5f);
47 float z1 = floorf(z-0.5f);
48 float ax = x - x1;
49 float ay = y - y1;
50 float az = z - z1;
54 float ly1 =
lerp(a[idx(kz1,ky1,kx1)],a[idx(kz1,ky1,kx2)],ax);
55 float ly2 =
lerp(a[idx(kz1,ky2,kx1)],a[idx(kz1,ky2,kx2)],ax);
56 float lz1 =
lerp(ly1,ly2,ay); // bilinear x-y interp at z1
57 float ly3 =
lerp(a[idx(kz2,ky1,kx1)],a[idx(kz2,ky1,kx2)],ax);
58 float ly4 =
lerp(a[idx(kz2,ky2,kx1)],a[idx(kz2,ky2,kx2)],ax);
59 float lz2 =
lerp(ly3,ly4,ay); // bilinear x-y interp at z2
60 float val =
lerp(lz1,lz2,az); // trilinear interp x-y-z
61 return (T)val;
62 }
5.9 3D Image Registration 167
25 float yr = aff.A1.x*x+aff.A1.y*y+aff.A1.z*z +
dy*aff.A1.w +0.5f/n.y +0.5f;
26 float zr = aff.A2.x*x+aff.A2.y*y+aff.A2.z*z +
dz*aff.A2.w +0.5f/n.z +0.5f;
27 if(cost == nullptr) b[mdx(iz,iy,ix)] =
(ushort)(65535.0f*tex3D<float>(atex,xr,yr,zr));
27.1 else {
27.2 float bval = (float)b[mdx(iz,iy,ix)];
27.3 float aval = 65535.0f*tex3D<float>
(atex,xr,yr,zr);
27.4 cost[cdx(iy,ix)] += (aval-bval)*(aval-bval);
}
28 }
29 }
○ Line 27.3: Set aval to the trilinear interpolated value of a at the transformed position (xr,yr,zr)
○ Line 27.4: Calculate the square of the difference between the two values and add this to the sum
for this thread in its element of cost. This sum is accumulated over all z values.
Note the sum computed here will be minimised when the transformation aff takes a into the frame
of b, i.e. this code registers volume a to volume b. Volume b is the static or target volume and a is the
moving or source volume. Note this is not a restriction on the code, as the volumes used for a and b
depend on the order which they are specified by the user on the command line.
5.9 3D Image Registration 169
Also note that line 27.4 is the only line that needs changing to implement different cost functions,
for example, fabs(aval-bval) or powf(aval-bval, cpower) where cpower is a user
adjustable parameter.
The remainder of our code in examples 5.10–5.13 is host code built around the
GPU kernel.
Example 5.10 The struct paramset used for affine image registration
30 struct paramset {
31 affparams p;
32 float a[7]; // 3 rotations, 3 translations & 1 scale
33 // constructor
34 paramset(float sc =1.0f,float ax =0.0f,float ay =0.0f,
35 float az =0.0f,float tx =0.0f,float ty =0.0f,
36 float tz =0.0f)
37 {
// units are degrees and pixels
38 a[0] = ax; a[1] = ay; a[2] = az; // rotations
39 a[3] = tx; a[4] = ty; a[5] = tz; // translations
40 a[6] = sc; // global scale
41 }
42 };
• Line 31: A copy of the affine transformation matrix used by the GPU code is stored as affparams p,
the first element of the stuct.
• Line 32: The 7 physical transformation parameters stored as elements of the array a[7]. Storing
parameters in this fashion allows the optimisation code to simply loop over these parameters which
reduces the complexity of that code.
7
• Lines 34–41: In Cþþ the keywords struct and class are equivalent. Thus, here we can include a
constructor for the struct with default values which specify an identity transformation.
50 struct cost_functor {
51 ushort *b; // moving image
52 float *d; // buffer of size m.x x m.y
53 thrust::device_vector<float> dsum; // single word
170 Textures
• Line 50: Here we declare our functor cost_functor; note that like other Cþþ objects we will
need to create one or more instances of this struct to actually use the functor.
• Lines 51–56: Here we declare the variables contained in the struct; they are basically all the
arguments we need to call the costfun_sumsq kernel. Hiding them in the functor means that
our optimisation codes will not need to deal with them. The variable calls records the number of
calls to the functor which is helpful for monitoring the efficiency of the optimisation algorithm
being used.
• Line 57: This is a default constructor for the functor which does nothing. In most cases Cþþ
provides a default constructor (which sets everything to zero), but that is not the case if an
overloaded operator () is declared.
5.9 3D Image Registration 171
• Lines 58–60: This is a proper constructor which sets all the parameters to user supplied values.
• Lines 61–71: Here we define the functor itself.
○ Line 61: The functor takes a single argument which is a reference to a paramset object s.
○ Line 62: Here we calculate the affparams values s.p from the values stored in s.a by calling
the host function make_params. This function is relatively expensive in CPU time as it involves
several 3 3 matrix multiplications and computations of sin and cos for three angles. The
reason for placing this call inside the functor is to simplify the calling code on the assumption that
the functor is only called when at least one of the physical parameters has been changed. The
function make_params is shown as part of Example 5.13.
○ Lines 63–64: Set the usual thread block configuration for one thread per x-y position in the 3D
volume.
○ Line 65: Call the costfun_sumsq kernel with the first input argument set to an array d of
A simple function to find the minimum cost function is shown in Example 5.12.
100 sopt = s;
101 sopt.a[k] = leap;
102 float cnew = cf(sopt);
103 if(cnew < cost1) sb.a[k] = leap;
104 }
105 }
106 // here if parabolic maximum, so go to smallest
107 else sb.a[k] = (cost0 < cost2) ? sl.a[k] : sh.a[k];
108 }
199 float cost3 = cf(sb);
110 if(cost3 < cost1) s = sb; // update only if improved
111 return cost3;
112 }
• Line 80: The function takes 3 arguments, the current parameter values in paramset s, a reference
to the cost function functor cf and a scaling value scale for the step size.
• Line 83: Default step sizes for the parameters are held in the array step. These parameters could be
tuned to improve performance and accuracy.
• Lines 84–87: Several paramsets are defined here; sl for a step down, sh for a step up, sopt for
the current trial set and sb for the best set found so far.
• Line 88: The cost function value on entry is stored in cost1.
• Lines 89–109: Loop over the 7 parameters where each one is optimised in turn.
• Lines 91–94: Set sh and sl to the starting set with only parameter k adjusted up or down by an
amount which depends both on the step[k] and scale. The cost function values for these
parameter sets are stored in cost0 and cost2.
• Line 95: If the cost1 is greater than both cost0 and cost2 then that parabolic fit is a maximum
and we go straight to line 107 and store in sb the parameter corresponding to the smaller of cost0
and cost2.
• Line 96: Here we evaluate the denominator of Eq. 5.4; the result is stored in div.
• Line 97: We check that div is not too close to zero; the cut-off used is tiny given that the cost
functions have values of 1012 or more.
• Line 98: Here we calculate the position of the turning point from Eq. 5.4.
• Line 99: We adjust the proposed parameter step to be not more than twice the current step size. The value
of 2.0 used here turns out to be quite a sensitive tuning parameter. The new parameter is stored in leap.
5.9 3D Image Registration 173
• Lines 100–104: Here we calculate the cost function cfnew using the new value for the current
parameter; if this is smaller than the current value of cost1 we update sb.
Next, in Example 5.13 we show that part of the main routine that calls optimise.
. . .
150 paramset s; // default constructor used
151 cost_functor cf(dev_b.data().get(),
dev_d.data().get(), dev_dsum, atex.tex, n, m);
152 cx::timer tim;
153 float cf2 = cf(s); // cost function before optimisation
○ Lines 154–157: Declare and initialise variables used during optimisation loops.
○ Line 158: Start of outer for loop, the variable scale is reduced on each pass through this loop.
○ Line 159: Start of inner while loop where we call optimise repeatedly with a fixed value of
scale until no further improvement of the cost function is found. A maximum number of calls is
imposed to prevent endless loops.
○ Lines 160–163: Here we call optimise and update cfnew and cfold each time an improvement
is made. Note that the paramset s is also updated by cf when an improvement is made.
○ Lines 165–167: Print progress at end of inner loop and set iter and scale to values required
Some typical results for this code are shown in the next section.
5.10 Image Registration Results 175
about 310 ms. Thus, we conclude that firstly our code is really very fast by historic standards
for this sort of registration problem, and secondly the overall time is dominated by the host
code not the GPU. In Chapter 7 we show how to overlap both disk IO with host calculation
and host calculation with GPU calculation. These techniques would give further speed-ups,
useful for big-data applications where thousands of MRI images might need frequent
registration. On this happy note we end this chapter. If you are interested in the gory details
of texture creation then Appendix H4 has all the details of the cx routines and the NVIDIA
Cþþ Programming Guide has even more information.
In the next chapter we move on to discuss random number generation and applications.
Monte Carlo methods have many scientific applications including, very importantly, appli-
cations to simulation.
5.10 Image Registration Results 177
Endnotes Chapter 5
1 In many cases pixel values are only proportional to the sampled distribution function; this is due to the
various rescaling operations that are likely to have occurred between image capture and image display.
2 As a reminder, we follow the usual mathematical convention that the point with coordinates (x, y) on
some graph has a horizontal displacement of x along the x-axis and a vertical displacement of y along
the y-axis. However, in our computer code an image stored in a 2D array A is addressed as A[y][x] and
that is because of the C/Cþþ convention that it is the rightmost index that addresses adjacent memory
locations. As a further source of confusion, in image processing the coordinate origin the usually taken
to be the top left-hand corner of the image so that the y-axis runs downwards whereas in mathematical
and most scientific work the convention is for the y axis to run upwards from an origin at the bottom
left-hand corner.
3 It is quite hard to get this sign choice right. For the transformed image to have a clockwise rotation we
need the pixel (xt, yt) in the original image to have an anti-clockwise rotation (or more generally apply
the inverse transformation). We also need to remember that we are using left-handed axes where y points
downwards from the origin.
4 Of course, in production code one can perform a check on whether a candidate output file already exists
and ask the user for permission to overwrite. We do not do this routinely in our demonstration code to
keep things compact.
5 On early GPUs there was a performance gain from reading data from texture memory rather than global
memory, particularly for 2D arrays. Thus, there are still many examples advocating this floating around
internet. On modern GPUs my own tests suggest that reading from simple pointers to global memory is
faster than using textures and less is complicated. But this only true if we remember to use const
__restrict__ qualifiers on the kernel input arguments.
6 We rather favour png for scientific image processing because it uses lossless image compression which
makes for smaller files than bmp but does not lose information like most compressed jpg files.
7 There is one small difference, by default the member variables are public for a stuct and private for a
class. In both cases the default can be overridden.
6
NVIDIA GPUs can generate random numbers extremely fast and this enables their use for a
vast range of applications in both Monte Carlo integration and simulation. In this chapter we
explain the various ways in which the cuRAND library can be used.
6.1 Introduction
Scientific applications of Monte Carlo methods have always been important and their
applications continue to grow in importance, tracking the growth and availability of com-
puting power. Today they are a vital tool in many areas of science. To oversimplify, these
applications can be classified into two groups; (a) integration – where some function is
sampled over random points in its domain and (b) simulation – where the behaviour of some
physical system or piece of experimental equipment
sanet.st is investigated using random numbers to
mimic a stochastic process. A “good” random number generator (RNG) is an essential tool
for many applications of computers in science.
Computers rarely use genuinely random numbers;1 rather they use either pseudorandom
numbers or quasirandom numbers. A pseudorandom number generator (PRNG or just RNG)
starts with an initial bit string (the seed) and calculates a new string where the new bits have
little correlation with the seed bits. This bit string can be converted to either an integer or a
floating-point value and returned to the user as a random number. The generated bit string
then becomes the seed for the next number.
A well-known example of such a generator is the rand function in Cþþ, which is actually
a legacy from the early days of C; it is simple to use but is now deprecated by most people. The
reasons why rand is disliked are firstly that it returns integers in the range 0 to RAND_MAX
where RAND_MAX is an implementation defined constant which is typically either 215‒1
(Visual Studio 17) or 231‒1 (gcc 4.8.4). The former value is quite small which makes it hard
to scale the results for other ranges. Moreover, the quality of the random numbers is not
guaranteed and unwanted correlations between sets of generated numbers may be present.
Finding good algorithms for RNGs which are both fast and which produce sequences with
minimal correlations between values is an ongoing research effort in computer science, but it
is fair to say that modern generators are now excellent. Both modern Cþþ and CUDA
provide libraries for random number generation. For Cþþ11 and later versions, including
the header file <random> gives access to a number of powerful generators. For CUDA the
header files <curand.h> and <curand_kernel.h> give access to generators for either
host code or kernel code.
178
6.1 Introduction 179
slow and has poor behaviour for long sequences of numbers. It is, however, a good alternative to
using the time of day clock when generating random seeds for other generators and that is why we
use it here.
• Lines 8–9: Set the int variables points and passes. The code is designed to process a potentially
very large number of points and so uses a doubly nested pair of loops. The inner loop in lines 17–21
processes points generations and accumulates the number of hits in the int variable subtot.
The outer loop between lines 15 and 23 accumulates the values of subtot in the long long
variable pisum which is used later to estimate pi. The number of iterations in the outer loop is
controlled by the user settable variable passes.
• Line 10: Here we initialise the variable seed using a value either rd or user input.
• Line 11: Here we create an instance, gen, of an RNG having type default_generator_
engine, and initialise it with seed.
• Line 12: Here we create fdist, a uniform distribution of floating point 32-bit numbers in the
range [0,1). Notice how RNG gen and distribution fdist are used together in lines 18 and 19. The
<random> library provides a variety of generators and distributions most of which can be
combined together in this way.
• Line 13: Initialises a 64-bit integer pisum to count the number of points generated inside the circle.
• Line 14: Creates an instance tim of a timer object from the cx utility file cxtimers.h. The timer
contained within tim starts immediately.
• Lines 15–23: These are the doubly nested loops where random points within a unit square are
created in lines 18 and 19 and then tested for being inside the unit circle in line 20. The number of
hits is accumulated firstly in inner loop variable subtot which itself is accumulated in the outer
loop long variable pisum.
• Line 24: After the loops end, the time taken is stored in the variable gen_time in units of ms.
sanet.st
• Lines 25–26: Here we calculate an estimate of the value of π as four times the ratio of pisum and
ntot and also the fractional error on this estimate in ppm.
02 #include "cx.h"
03 #include "cxtimers.h"
04 #include <random>
11 std::default_random_engine gen(seed);
12 std::uniform_real_distribution<float> fdist(0.0,1.0);
We will use the 75 seconds required for this calculation as a baseline to improve on. The
fractional error of 23.0 is given in parts per million. The number of generated points
falling inside the circle has a binomial probability distribution, which in this case has an
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
9
π π
error of 4 1 4 10 so the expected fractional error in our estimate of π is
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ð4=π 1Þ=109 ¼ 16:5 106 : The fractional error reported here, based on the deviation
from the true value of π, is consistent with this.
The time taken is actually quite large and it turns out that the host code can be improved
by using a different Cþþ distribution function. During testing we noticed by accident that
the Cþþ uniform float distribution used in Example 6.1 is seven times slower than the
corresponding integer distribution. Accordingly, we can change the code to use the faster
int distribution by making the changes shown in Example 6.2.
182 Monte Carlo Applications
. . .
// uniform ints in [0,2^31-1]
12 std::uniform_int_distribution<int> idist(0,2147483647);
12.1 double idist_scale= 1.0/2147483647.0;
. . .
18 float x= idist_scale*idist(gen); // uniform floats in [0,1.0)
19 float y= idist_scale*idist(gen);
. . .
D:\>pieH2.exe 1000 123456
pi = 3.14158036 err -3.9 ntot 1000000000, time 10275.153 ms
We can get even better performance from the host by using OpenMP to run several
parallel host threads. There is, however, one issue to consider; we cannot simply share the
loops between several threads as we did in Chapter 1. This is because the fdist or idist
functions are likely to be called by multiple threads simultaneously and these functions may
not be thread-safe.3
Our solution for this problem is to warp the allocation and use of the RNGs into a single
function and have that function called by multiple threads. In this way, each thread will get
its own instance of the generator and these multiple instances will be able to safely run
simultaneously. One vital detail is to ensure that each thread uses a different seed when
initialising its copy of the RNG. If this were not done, each thread would see the same
random number sequence rather defeating the point of running multiple threads. This is an
obvious point but in practice rather easy to overlook. Interestingly, our kernel code discussed
below also uses one generator per thread and hence has the same issue.
Example 6.3 shows our implementation; the main change is that we have moved lines
13–25 of Example 6.1 which create and initialise the RNG and perform the required
summations to the separate function, part_sum, as shown in lines 11–28 of
Example 6.3. Each OMP thread which calls this function creates a separate instance of
the RNG initialised with a seed that depends on the thread number. The main routine
6.1 Introduction 183
then sums the contributions from each thread in an OMP reduce operation. Notice the same
effect could be obtained by running our single thread program several times with different
seeds and averaging the results. The separate programs would run in parallel if launched
05 #include "omp.h"
35 cx::timer tim;
36 #pragma omp parallel for reduction (+:pisum)
37 for(int k = 0; k < omp_threads; k++) {
38 pisum += sum_part(seed,pass,passes/omp_threads);
39 }
40 double gen_time = tim.lap_ms();
184 Monte Carlo Applications
41 double pi = 4.0*(double)pisum
/((double)(passes)*(double)points);
42 double frac_error = 1000000.0*
(pi-cx::pi<double>)/cx::pi<double>;
43 long long ntot = (long long)(passes)*(long long)points;
44 printf("pi = %10.8f err %.1f, ntot %lld,
time %.3f ms\n", pi, frac_error, ntot, gen_time);
46 return 0;
47 }
together on a multi core PC. As discussed in Chapter 1, this is a standard approach used by
parallel programmers for “embarrassingly parallel” code. A more detailed description of the
code follows.
this is directly equivalent to the tid or id variables used frequently in our kernel codes.
○ Line 10: Initialises an instance of the Cþþ default random generator for this thread using a seed
which depends on the rank of the thread, thus ensuring that separate threads use different
sequences of random numbers.
○ Lines 11–23: These are the same as 14–24 in Examples 6.1 and 6.2 and calculate the number of
• Lines 27–30: These are the beginning of the main routine and are the same as the corresponding
lines in Examples 6.1 and 6.2. They initialise the variables points, passes and seed using
optional user input.
• Line 31: Is new and initialises the variable omp_threads to the number of OMP threads to
be used.
• Line 33: Tells OMP how many threads to use with a standard library call.
• Lines 34–35: Initialise the accumulator pisum and timer tim.
• Line 36: The pragma tells OMP to split the for loop beginning at line 38 into omp_threads
separate threads and, since the loop counter is also equal to omp_threads, each thread will
execute a single call to sum_part in line 39. The pragma also requests a reduction operation on
pisum; thus the individual return values will be summed and pisum will contain the total
contribution from all threads when the loop is done.
• Lines 37–39: The for loop calls sum_part once for each OMP thread.
6.2 The cuRAND Library 185
• Lines 40–46: Get elapsed time and print results. These lines are the same as in the
previous examples.
• The results shown at the end the example are for 8 threads which was optimum for the
platform used.
For our PC we found eight OMP threads gave the best performance; about 2.46 seconds were
required to generate 109 points. The OMP version gives a speed-up by a factor of 30 compared to
our initial base version. This is quite good but we can do better with the help of the GPU.
○ Lines 10–11: Here we get a candidate point in the square by simply looking up a pair of values
from the input random number array rnum.
• Lines 24–25: Here we create a pair of thrust vectors rdm and dev_rdm to hold the random numbers
on the host and device. Since we process points in blocks of size points, these array have size
2*points.
• Lines 29–31: These lines create and initialise an instance gen of a cuRAND random number
generator having default type XORWOW and using the variable seed as the initial seed. These
lines are equivalent to line 11 of Example 6.1 which uses the Cþþ11 library <random> for the
same purpose.
• Lines 33–37: This is a modified version of the outer loop which calls sum_part with a different set
of random numbers on each pass.
○ Line 34: This is the most import line in the whole example; here we call a cuRAND distribution
function to generate a set of random numbers in GPU memory using the previously created
generator gen sampled by the distribution implicit in the function call. In this case the function
specified, curandGenerateUniform, produces a uniform distribution of 32-bit floats in (0,1].
The arguments passed to the distribution function are the generator gen, a pointer to the device
memory buffer and the number of points required. Notice the library function handles all the
details required to actually perform the calculation, for example, kernel launches.
○ Line 35: Here we copy the random numbers back to the host.
○ Line 36: Call sum_part with a new set of random numbers each time.
sanet.st
Example 6.4 piH4 with cuRand Host API
. . .
05 #include "curand.h"
46 curandDestroyGenerator(gen); // tidy up
47 return 0;
48 }
The performance of Example 6.4 is similar to the previous OMP version. However, one
difference is that the GPU version requires a single host thread whereas the OMP version uses
all eight threads on the 4-core hyperthreading PC used. Therefore, this is already a modest
improvement. The host API is limited by the need to copy data from the GPU to Host. In this
example we use a total of 109 points which means transferring 8 GB across the PCI bus, at
6 GB/sec; this takes about 1.3 seconds. We can easily reduce this overhead by using “pinned”
memory for the host random number buffer. Modern operating systems use sophisticated real-
time memory management techniques which mean that the memory addresses used by a
typical executing program are virtual in the sense that they may be remapped to different
physical addresses from time to time during program execution. Blocks of pinned memory are
guaranteed not be remapped which allows faster PCI transfer using DMA at 11 GB/sec.
We can take advantage of pinned memory very simply by changing the allocation in line 24 from
using the cx defined type thrustHvec to thrustHvecPin as shown in Example 6.5.
188 Monte Carlo Applications
Example 6.5 piH5 with cuRand Host API and pinned memory
The time saving observed in Example 6.5 is about 0.8 secs consistent with the improved
PCI bandwidth. The cx wrappers used to allocate thrust vectors hide a little bit of complex-
ity; without them the code would look as shown in the box.
Even with the use of pinned memory the transfer of 2 109 random numbers takes 720
ms, but we can hide nearly all of this cost by overlapping memory transfers with host
calculation. In essence we need to write code where the host processes random numbers in
block N while the GPU calculates block Nþ1 and transfers the data to the host in parallel
with the host calculation. Fortunately, CUDA provides a cudaMemcpyAsync function that
does exactly this. To make this code work we obviously will need two buffers, for example,
a and b, for blocks of random numbers that are operated in ping-pong fashion, with the host
processing one block while the GPU works simultaneously with the other block. Example
6.6 shows the resulting code. Note that Chapter 7 on CUDA Streams and Events discusses
synchronous operations in CUDA in much more detail.
. . .
17 int main(int argc,char *argv[])
18 {
19 std::random_device rd;
20 int points = 1000000;
21 int passes = (argc >1) ? atoi(argv[1]) : 1;
6.2 The cuRAND Library 189
// CUDA event
27 cudaEvent_t copydone; cudaEventCreate(©done);
45 double pi = 4.0*(double)pisum /
((double)points*(double)passes);
46 long long ntot = passes*points;
47 double frac_error = 1000000.0*
(pi - cx::pi<double>)/cx::pi<double>;
48 printf("pi = %10.8f err %.1f, ntot %lld, time %.3f
ms\n", pi, frac_error, ntot, t1);
49 // tidy up
50 cudaFreeHost(a); cudaFreeHost(b); cudaFree(dev_rdm);
51 curandDestroyGenerator(gen);
52 return 0;
53 }
• Lines 23–26: We cannot perform asynchronous memory transfers using thrust copy operations, so in
this example we revert to the native CUDA memory allocation method using versions of the
cudaMalloc function. In lines 24 and 25 we use cudaMallocHost to allocate two host buffers,
a and b, in host pinned memory. Notice that like most CUDA functions we have to allocate the
pointer variables first and then initialize by passing them an argument to a CUDA function. Line 26:
This is a standard memory allocation in device memory using the cudaMalloc function. Notice
that unlike the Cþþ malloc-like functions, in CUDA the array size is specified in bytes not as the
dimension of required array.
• Line 27: To manage asynchronous memory transfers we need to create a CUDA event copydone.
As explained in detail in the next chapter, work submitted to the GPU can be sent to different
streams which operate asynchronously with each other. Items of work in a single stream are run one
at a time in the order they were submitted. A default stream (the null stream or 0) is used in the case
where no stream is explicitly mentioned. CUDA events can be placed in streams between other
pieces of work and when queried return either cudaErrorNotReady if the previous work is not
yet complete or cudaSuccess if the previous work is complete. The CUDA event copydone, as
its name implies, will be used to wait for the completion of the asynchronous memory transfers,
analogously to the use of cudaDeviceSynchronize() to wait for kernel completion.
• Lines 28–32: These are the same as lines 27–31 of Example 6.4; they initialize a timer and set up
the generator.
• Lines 33–34: Generate the first block of random numbers and then copy them to the host buffer a. In
sanet.st
line 35 we use the standard cudaMemcpy function which is not asynchronous but blocks the host
until the transfer is complete.
• Lines 35–43: This is the main loop where blocks of random numbers are generated and processed.
○ Line 36: Starts the generation of the next block of random numbers on the GPU.
○ Line 37: Starts the asynchronous copy of the new block of random numbers to the buffer b. It is
important to know that the work in both this line and the previous line are sent to the default
CUDA stream so that the cudaMemcpyAsync will wait for the completion of the
curandGenerateUniform in line 36 before starting.
○ Line 38: Places our CUDA event in the default stream here. The stream is specified by the
second argument.
○ Line 39: Immediately query the copydone event. It is not logically necessary to query the event
at this point but at the time of writing this the statement is needed here, at least with the Windows
WDDM driver.
○ Line 40: Here we call sum_part with the random numbers contained in buffer a. This
calculation proceeds in parallel with the downloading of a new set of random numbers into buffer
b. In this example the call to sum_part requires more time to execute (about 1 ms) than the
download operation (about 0.7 ms) so the cost of random number generation on the GPU and
downloading to the host is entirely hidden.
○ Line 41: Here we swap the pointers a and b; this implements ping-pong buffer use while
keeping the rest of the code simple. This kind of pointer trickery must be used with care as it is
easy to get confused. In the present program a and b are not used again until the memory they
point to is freed in line 50. At that point it does not matter if the current a points to the original b
and vice versa. But you need to always think about such details with care when you reassign
pointers.
○ Line 42: Wait until the copydone event reports success. At that point, a will point to a complete
and freshly loaded block of new random numbers, ready for the next pass through the for loop.
6.2 The cuRAND Library 191
• Lines 44–48: Print results, these lines are the same as the corresponding lines in Example 6.4.
• Lines 50–51: Free GPU resources. Since we are not using thrust containers, we have to explicitly
free memory allocations on both the host and device.
The results from Example 6.6 are good in that we have removed nearly all the overhead of
random number generation for the host and our code is now about 70 times faster than the
original. If we want to go faster we need to also move the host calculations to the GPU; for
this we need the cuRAND Device API.
In case (a) where we use the same seed and hence the same generator of all threads, each
thread uses a separate subsequence. The subsequences are determined by the second
argument which is set to id, the sequential rank of each thread in the thread grid. This
6.2 The cuRAND Library 193
gives random numbers with the best statistical properties but generating the initial state for
all subsequences is slow, depending linearly on the number of threads. Method (a) is about
60 times slower than method (b). In Example 6.7 initialisation for 223 threads is required and
takes 184 ms for method (a) but only 2.9 ms for method (b). These numbers were measured
using an RTX 2070 card with CUDA SDK version 11.4, the performance of method (a) was
significantly worse in earlier SDK releases.
Example 6.7 piG kernel for calculation of π using cuRand Device API
01 #include "cx.h"
02 #include "cxtimers.h"
03 #include "curand_kernel.h"
04 #include <random>
sanet.st
D:\ >piG.exe 40 123456 8192 1024
pi = 3.14159441 err 0.559, ntot 1099511627776, time 4507.684 ms
set by the user and an array states in device global memory where each thread stores the final
state of its generator.
○ Line 7: Set id to the rank of the current thread in thread-grid.
○ Line 8: Call curand_init to initialise the default random number generator. In all versions of
cuRAND to date that is the XORWOW generator and here we initialise it using the faster method
(b) with a different seed for each thread. We use seed+id as the simplest way to provide unique
seed values for each thread. In a demanding application, a better way of getting more random bit
patterns might be desirable. For example, multiplying by large prime numbers or using id as an
index to a precalculated table. Correlation between early random numbers is the most likely
problem, so flushing the first 1000 or so would be another possibility (and still much faster than
method (a). Note the generator’s initial state is stored in the fourth argument state[id].
○ Line 9: This is the initialisation for alternative method (a); it is commented out here.
6.2 The cuRAND Library 195
• Lines 11–23: This is the kernel function piG, which is the heart of our program and is where all the
time is spent.
○ Line 11: The first and third arguments are similar to the host version. The first argument tsum is a
device array used to hold the partial sums accumulated by each thread. The second input
argument, states, contains the RNG state for each thread as set by the previous call to
init_generator. The final argument points is the number of points to be generated by
each thread. Note that as the number of hits found is stored in the float array tsum the value of
points should not exceed about 225 otherwise there is a danger of losing hits. If this were to be a
problem then tsum could be promoted to a double without incurring any significant time penalty
because it is only used once at the end of the kernel to store final results. We use a floating point
rather than a long integer type for tsum in order to facilitate a subsequent thrust
reduce operation.
Note also that this is a templated function where the template parameter S is the type of random
number generator being used. Thus, this function does not need changing if a different cuRAND
generator is used. Different generators need significantly different methods of initialisation so we
cannot simply template the initialisation function curand_init, a different function would be
needed to try a different generator.
○ Line 13: Set id to the current thread’s rank in the thread-grid.
○ Line 14: This copies the RNG state in states[id] to a local variable state. Copying from
device memory to a local variable held in a register is always worthwhile in cases like this where
state is updated in an inner loop.
○ Lines 15–20: This is the main loop which for points passes generates points in the unit square
and tests if they are also inside the required circle. These lines correspond exactly to the inner loop
in lines 17–21 of Example 6.1. The outer loop of that example has been replaced here by parallel
execution of multiple threads.
○ Line 21: The final sum for the current thread is stored in tsum[id] in device global memory.
strictly necessary for this example as no further use will be made of the generators. However, it is
an essential step in more complicated programs where multiple kernels use the generators. We
recommend you always include a final state save in your cuRAND GPU code.
• Lines 30–52: This is the host code, which sets up the calculation using user supplied values, runs the
two kernels and computes and prints the final result.
• Lines 32–36: Set the configuration parameters for the calculation using optional user input. The
number of points to be generated is entered by the user as shift which is used as a power of 2. The
variable seed is the base seed used to initialise the RNGs. Note our use of atoll[argv(2)]
which allows the user to enter long long values if desired. The variables threads and blocks
are the usual CUDA 1D thread block and grid sizes.
• Line 37: The variable ntot is the total number of points to be generated and is the power of
2 determined by shift.
• Lines 39–40: The value of ntot is rounded up to be an exact multiple of size, that is, the total
number of threads in the thread-grid and points, the number of points to be generated by
each thread.
• Lines 41–42: These lines create two thrust device arrays of dimension size. The first array tsum
will store the number of hits inside the circle found by each thread and the second array state will
hold the RNG state for each thread. We implicitly rely on tsum being initialised to zero by thrust
in our kernels. Note that the array state is of type curandState which implies that the default
XORWOW generator is to be used.
196 Monte Carlo Applications
• Lines 43–47: This timed section is where the kernels are called to perform the entire calculation; it is
equivalent to the timed loops in the previous host code examples.
○ Line 44: This calls the kernel init_generator to initialize the generators. This kernel is a
template function which determines which generator to use from the type of its second input
argument state.
○ Line 45: Call the sum_part kernel which does all the work of the calculation. After the call, the
number of points found inside the circle for each thread is stored in the elements of the device
array tsum.
○ Line 46: The elements of the array tsum are added together using a host call to thrust::
reduce, which runs on the GPU using data stored in device memory, and then returns the final
sum to host. This call is blocking on the host so there is no need for a final
cudaDeviceSynchronize.
Notice the entire calculation is done on the GPU; only the final sum needs to be copied to the host
as a 4-byte float.
• Lines 48–50: Calculate and print π and the fractional error in parts per million.
The thread block and grid-block sizes of 1024 and 8192 used in this example were manually
optimised for the RTX 2070. We think the final performance is very impressive; moving the entire
calculation to the GPUhas given a further speed-up of over 280. The speed-up compared to our
baseline Example 6.1 of about 18,000. Although we have used a trivial integration for this
example, the fact that we can process 1012 points in about 4 seconds is interesting for less trivial
Monte Carlo integrations. In Chapter 8 we give a more substantial example involving simulating
sanet.st
the detection efficiency of a Positron Emission Tomography (PET) scanner.
The timing results for Examples 6.1–6.7 are summarised in Table 6.1.
Table 6.1 Times required for random number generators using an RTX 2070 GPU
pffiffiffiffiffi
Log-normal float/double 2
Pðxjμ; σ Þ ¼ eð ln xμÞ =2σ =σx 2π
2
pffiffiffiffiffi
Log-normal float/double 2
Pðxjμ; σ Þ ¼ eð ln xμÞ =2σ =σx 2π
2
We shall need this formula later in Chapter 8, but here it is worth noting that in the special
case a = 0 the right-hand side of Eq. 6.3 becomes bu1=2 . This is useful for directly generating
points inside a circle because the density of points at a distance r from the centre of the circle
is proportional to r and not constant as would be the case for points inside a square. Thus to
generate a point uniformly distributed inside a circle of radius R, use polar coordinates and
1=2
set the radius to r ¼ Ru1 and the polar angle to ϕ ¼ 2πu2 where u1 and u2 are random
numbers taken from a standard uniform distribution. This is a much more GPU friendly
method than the common alternative of generating Cartesian coordinates inside a square of
side R and rejecting those samples that are not also inside the circle. This is because in the
first case all threads get valid points whereas in the second case some threads get invalid
points and have to wait while other threads in the same warp process their valid points.
In many cases, notably the Gaussian distribution, it is not possible to find an analytic form
for the inverse function; in these cases a numerical solution in the form of a lookup table can
be used. On GPUs such lookup tables could be accessed using texture fetches for extra
speed. The lookup table itself is just the set of fx0 , x, , x2 ; . . . , xn g such that F ðxi Þ ¼ i=n,
where the equations are solved numerically. This approach might also be useful in cases
where an analytic expression for the inverse function does exist but is slow to calculate.
Eq. 6.4 is the simplest possible Ising model where we only consider interactions between
nearest neighbours. The constant J measures the strength of the spin-spin interactions and in
our simulation we set it to one. Notice that E will be positive if most of the neighbouring
spins are parallel to S x, y, z and negative if they are mostly antiparallel. In the simulation we
test a single arbitrary spin and flip it according to the criteria shown in the box.
The criteria in the box correctly simulate the thermodynamics of a physical system at an
absolute temperature proportional to T. The most interesting feature of the Ising model is that
it has a phase transition at about Tc = 4.5115 in the system of units used here. For
temperatures below Tc large scale domains of parallel spins form; above Tc they do not.
Interested readers can find more detail on solid state physics in standard texts or online.
An Ising model simulation demands that any given spin be updated independently of other
spins. Consider a 2D Ising model on a square grid. If we colour the spin sites alternately
white and black like a chessboard then the nearest neighbours of the white squares are all
black squares and vice versa. Thus, we can implement a parallel simulation by updating all
the white sites in parallel and then in a separate step updating all the black sites. This idea can
easily be extended to 3D models provided that alternate slices in z have opposite colours in
each x-y position. The advantage of this approach is that we can update spins in place
reducing the memory needed for the simulation.
Our code to simulate a 3D Ising model is shown in Example 6.8. We use our standard
approach to 3D grids and use one thread per x-y position and a loop over z values. The array
of spins is implemented using type char which can hold values in the range [‒128,127]. We
actually only need two values, namely 1 and ‒1 so a single bit would do, but unless we need
a very large grid, using char is better because it leads to much simpler code. The kernels are
implemented as template functions but only type char is used.
The code is shown below and described in Examples 6.8, 6.9 and 6.10 below. This
example has two features; firstly it shows an implementation of a 3D Ising model in
straightforward kernel code. That code is quite similar to the image processing stencils
discussed in previous chapters. The performance is fast but no doubt tricks like vector
loading could squeeze a bit more performance from the kernels. The second shows how
OpenCV can be used to create a simple interactive program that allows you to visualise the
progress of the spin flipping in the model and see the responses to changes in the model
temperature in real time.
○ Lines 24–26: For each thread, flush the first 100 numbers produced by the generator and then store
in resulting state in the device array state. This process reduces the chances of correlations
between generators and is still much faster than using method (a).
• Lines 40–56: The kernel init_spins, which randomly initialises the spins in the 3D device array
spin to either þ1 or ‒1. The input arguments are the generator states in state, the spin array
spin and the dimensions of the spin array. Note that although the type of the spin array is a
template parameter, the code will only work for signed integer types. Type char is obviously the
200 Monte Carlo Applications
best choice to economise on memory for large spin systems. Also note that to implement the chess
board update scheme the actual dimensions of the spin array are 2nx × ny × ny. In this kernel
and the flip_spin kernel the input parameter nx refers to the number of spins in the x-direction
that are updated on a single call. The variable nx2 is used for array addressing purposes.
○ Line 45: Set id to the current threads position in the x-y plane of the spin array.
○ Line 47: Define an idx lambda function to address the array spin; note we use nx2 for the
x dimension.
○ Lines 49–54: A loop over all z values where the current thread initialises elements of spin for its
value of x and y and all values of z. This is also done for xþnx and y to allow for the fact that nx
is only half the x dimension. Here the probabilities of spin up and down are equal and random.
Other initialisation schemes are possible here, for example, setting a block of spins to the same
value to study melting at edges.
01 #include "cx.h"
02 #include "cxtimers.h"
03 #include "cxbinio.h"
04 #include <curand_kernel.h>
05 #include <random>
sanet.st
07 #include "Windows.h"
08 #include "opencv2/core/core.hpp"
09 #include "opencv2/highgui/highgui.hpp"
10 using namespace cv;
45 int id = nx*y+x;
46 int nx2 = 2*nx; // actual x-dimension NX of volume
47 auto idx = [&nx2,&ny,&nz](int z,int y,int x)
{ return (ny*z+y)*nx2+x; };
65 int id = nx*y+xt;
66 int nx2 = 2*nx; // NB nx2 is volume x dimension
67 auto idx = [&nx2,&ny,&nz](int z,int y,int x)
{ return (ny*z+y)*nx2+x; };
68 // lambda functions for cyclic boundary conditions
69 auto cyadd = [](int a,int bound)
{ return a < bound-1 ? a+1 : 0; };
202 Monte Carlo Applications
○ Line 67: Declare a lambda index function idx to index the spin-array; note the use of nx2 for the
x dimension.
○ Lines 69–70: Two additional lambda functions for cyclic addition or subtraction by 1 are
defined here.
○ Lines 72–73: Here the y-indices for the nearest neighbours are found using cyclic arithmetic.
chessboard the positions of white and black squares will flip depending on whether z is even
or odd.
6.4 Ising Model 203
○ Lines 75–79: In line 77 we find the x coordinate for the current thread depending on its values of
y, z and colour. The other lines determine its nearest neighbours in the x and z directions.
○ Line 80: Here we simply add the spins of the six nearest neighbours.
○ Lines 81–83: Here we implement the thermodynamic-based spin flipping of the Ising model.
The main routine is shown in Example 6.10 and shows the use of OpenCV to implement a
simple interactive visualisation of the model.
151 cudaDeviceSynchronize();
152 double flip_time = tim.lap_ms();
153 printf("timing setup %.3f ms, init %.3f ms,
fliping %.3 fms\n", state_time,init_time,flip_time);
154 if(dosave > 0){
155 hspin = dspin;
156 cx::write_raw<char>("ising.raw",hspin.data(),volsize);
157 }
○ seed: The initial seed for the cuRAND random number generator. Note a value of zero specifies
○ threadx and thready: The 2D thread block dimension for the flip_spins kernel launches.
○ view: If this is set to a value greater than zero, a window will be opened showing the state of the
central z slice of the spin array. The window will be updated each time view update cycles are
completed. Note updating the window is relatively time consuming so set view to a small number
(1–10) if you want to inspect the early stages of the evolution or a large number (try 100) if you
want to inspect asymptotic behaviour.
○ wait: This is the pause in ms to wait for user input on the current frame. If set to a large value
(say 5000) you can press any key on the keyboard to move to the next frame. If set to a low value
(say 50) you effectively see a real-time movie.
• Lines 105–108: Adjust the x-dimension nx to be an even number and set nxby2 to be half of nx.
The variable nxby2 is used as the number of black or white squares along the x-axis in some
kernel calls.
206 Monte Carlo Applications
black squares.
○ Line 135: If the current step through the for loop is to be displayed, the code in lines 136–150 is
executed. The user settable variable view controls how often this occurs.
○ Line 136: Use thrust to copy a single x-y slice of the 3D array dspin to the 2D host array
hslice.
○ Line 137: Next copy hslice to the OpenCV Mat object view_image.
sanet.st
○ Line 138: Display the copied slice in the named window Spins using the OpenCV
function imshow.
○ Line 139: In OpenCV a window filled by a call to imshow only becomes visible when followed
by a call to waitKey. Here the user settable parameter wait specifies the time in ms to wait for
user input before proceeding. A value of zero indicates an indefinite wait. This function returns on
any keystroke and its return value is the code corresponding to the key pressed.
○ Lines 140–148: This is a mini event-loop where different actions are taken in response to different
The timing results shown at the end of Example 6.10 correspond to about 4 1010 spins
processed per second and Nsight Compute reports about 180 GFlops/sec of compute
performance. This is a good but not outstanding result for the RTX 2070 GPU. The
performance is being held back by memory latency so improvement could be expected if
vector loading were used. Performance might also be improved if shared memory were used,
although as each spin value is shared by at most seven threads, probably native caching is
just as effective. Another optimisation that would improve the locality of memory access is
6.4 Ising Model 207
the use of separate 3D arrays for the white and black voxels. Nevertheless, the performance
from this very straightforward code is fast enough to allow the study and visualisation of
large 3D systems at excellent frame rates. Some results from the simulation are shown in
Figure 6.2.
The images shown in Figure 6.2 are a central slice from the simulation of a volume with
512 512 512 spins. In (a) the starting configuration with random spins is shown, (b)
shows steady state after 500 iterations at Tcþ1, (c) shows the steady state after 100 iterations
at Tc. Images (d) to (h) show the evolution of the state at Tc‒1 after 5, 20, 50, 100 and 500
iterations. Where Tc is the critical temperature of 4.5115 and þ1 spins are shown black and
‒1 spins are shown white. The states at temperatures Tc and above persist indefinitely with
fluctuations but no emerging structure. The states below Tc develop persistent clusters of like
spins which grow and merge over time. Thermal “noise” from individual spin flips can be
seen on these clusters. This noise increases as we approach Tc from below.
As a final comment on visualisation, the OpenCV implementation shown here is ineffi-
cient in that several, logically unnecessary, copies of the image data across the PCI bus are
performed. The slice data originally in the GPU array dspin is copied to the host array
hspin and then again to another host array in the OpenCV Mat file view_image. Finally,
the call to imshow triggers another copy of the data back across the PCI bus to a separate
GPU memory buffer maintained by OpenCV. These overheads are not important for our
example but might be for more demanding visualisations.
For visualisation, an alternative to OpenCV is the use of the OpenGL which has been well
integrated with CUDA from the very first release. The key advantage is that CUDA kernels
can operate directly on OpenGL memory buffers. However, the resulting code is quite
verbose and opaque; interested readers can find out more in the Chapter 3 of the NVIDIA
Cþþ Programming Guide. The SDK volumeRender example is a good example to study.
Interestingly, the most recent versions of OpenCV have introduced GpuMat objects which
interact well with thrust.5 We feel this is probably a better way to go for new projects.
208 Monte Carlo Applications
Our next chapter will show you how to speed up calculations by overlapping operations.
Then in Chapter 8 we return to random numbers and develop code to simulate a medical
imaging system.
Endnotes Chapter 6
1 This is possible on modern PCs using either special hardware which uses thermal noise or some other
hardware stochastic process to generate truly randomnumbers. The Cþþ standard for <random>
actually includes the generator, random_device, that uses such a source if available. The random_device
generator is quite slow and in practice is best used to seed other RNGs. Actually, true randomness is
often a bad idea during program development, because it is impossible to repeat a sequence to check
fixes for issues arising from a particular set of generated numbers.
2 This said we have also run this code using the gcc 4.8.4 under the ubuntu 14.04.4 shell in Windows 10.
In that case we find the opposite behaviour! The float version runs in about 8.2 seconds and the int
version in about 120 seconds.
3 Thread safety is an important consideration when writing multithreaded host code. Many functions
might fail if multiple threads access the same instance of the function simultaneously. One solution is to
implement a locking and queuing mechanism so that a resource can only be used by one thread at once.
However, queueing would defeat the aim of speeding up code by sharing a task across multiple threads.
In CUDA code, paradoxically, this is less of an issue as resources potentially accessed simultaneously
by multiple threads, such as shared memory, are explicit and tools such as syncthreads() are available to
help. Needless to say, CUDA intrinsic functions are thread-safe.
4 A lot of the objects used in cuRand have typedef aliases defined by appending _t to their actual name,
for example curandStateXORWOW_t. There is a convention used by some programmers that _t
indicates a type rather than an object, but it sanet.st
makes no difference to the compiler. The cuRand examples
in the NVIDIA SDK mostly follow that convention, but I prefer brevity and omit the _t.
5 See for example https://docs.opencv.org/master/d8/db9/tutorial_gpu_thrust_interop.html.
7
209
210 Concurrency Using CUDA Streams and Events
serially. The host can then send packages of independent work to different GPU streams and
the streams can then run them in parallel. In the host code, a CUDA stream is represented by
a cudaStream_t object. Kernel launches and calls to cudaMemcpyAsync can include a
cudaStream object as an optional final argument to place different pieces of work into
different cuda streams. If this argument is omitted or set to zero, then the so-called default
stream is used. Most of the previous examples in the book have in fact implicitly used this
default stream. CUDA streams are created and managed by the host and the most important
functions are shown in Table 7.1.
7.2 CUDA Pipeline Example 211
For each stream, the CUDA API creates a separate queue of work, the items in a particular
queue are completed in FIFO order and their execution does not overlap. In contrast
operations in different queues are unordered and their executions can overlap. Specifically,
modern GPUs can support concurrent simultaneous IO from host to device (H2D) on one
stream, IO from device to host (D2H) on another stream and execution of multiple kernels on
additional streams. CUDA events (discussed below) provide tools for managing work on
multiple streams, for example, by allowing one stream to wait for work on a second stream to
be complete.
We have seen in previous chapters that running a kernel with a large number of thread
blocks can be advantageous as this allows the hardware to better hide memory latency. This
is achieved by constantly switching between warps as their memory requirements become
available. In fact the GPU can also do this by switching between warps belonging to
different concurrently executing kernels which can further improve latency hiding. For
example, if a CPU bound kernel is run concurrently with a memory bound kernel then
execution time for both might be no more that for the memory bound kernel run on its own.
NVIDIA GPU hardware can typically support up to 32 concurrently executing kernels but to
save resources this is reduced to 8 by default. This number can be changed at runtime using
the environment variable CUDA_DEVICE_MAX_CONNECTIONS, as will be illustrated in
Example 7.1.
A CUDA stream is created using cudaStreamCreate as shown in the box:
Having created a stream object, we can then add work to it by using it as an argument in
kernel launches or cudaMemcpyAsync calls.
SDK documentation set. Examples of NVVP timelines for this example are included in the
following discussion.
Notice that the optimisation in this example is not the same as the overlapping host
computation with GPU processing which we have discussed previously. That optimisation
could be added to this example if there was useful additional host work to perform.
01 #include "cx.h"
02 #include "cxtimers.h"
03 #include "helper_math.h"
. . .
08 __global__ void mashData(cr_Ptr<float> a, r_Ptr<float>,
uint asize, int ktime)
09 {
10 int id = blockDim.x*blockIdx.x + threadIdx.x;
11 int stride = blockDim.x*gridDim.x;
12 for (int k = id; k<asize; k+=stride) {
13 float sum = 0.0f;
14 for (int m = 0; m < ktime; m++) {
15 sum += sqrtf(a[k]*a[k]+
(float)(threadIdx.x%32)+(float)m);
16 }
sanet.st
17 b[k] = sum;
18 }
19 }
29 if (maxcon > 0) {
30 char set_maxconnect[256];
31 sprintf(set_maxconnect,
"CUDA_DEVICE_MAX_CONNECTIONS=%d", maxcon);
32 _putenv(set_maxconnect);
33 }
34 thrustHvecPin<float> host(dsize); // host data buffer
35 thrustDvec<float> dev_in(dsize); // dev input buffer
36 thrustDvec<float> dev_out(dsize); // dev output buffer
7.2 CUDA Pipeline Example 213
43 cx::timer tim;
44 // data transfers & kernel launch in each async stream
45 for (int f=0; f<frames; f++) {
46 if(maxcon > 0) { // here for multiple async streams
47 cudaMemcpyAsync(in_ptr, hptr, sizeof(float)*fsize,
cudaMemcpyHostToDevice, streams[f]);
48 if(ktime > 0)
mashData<<<blocks,threads,0,streams[f]>>>
(in_ptr, out_ptr, fsize, ktime);
49 cudaMemcpyAsync(hptr, out_ptr, sizeof(float)*fsize,
cudaMemcpyDeviceToHost, streams[f]);
50 }
51 else { // here for single synchronous default stream
52 cudaMemcpyAsync(in_ptr, hptr, sizeof(float)*fsize,
cudaMemcpyHostToDevice, 0);
53 if(ktime > 0)mashData<<<blocks,threads,0,0>>>
(in_ptr, out_ptr, fsize, ktime);
54 cudaMemcpyAsync(hptr, out_ptr, sizeof(float)*fsize,
cudaMemcpyDeviceToHost, 0);
55 }
56 hptr += fsize; // point to next frame
57 in_ptr += fsize;
58 out_ptr += fsize;
59 }
60 cudaDeviceSynchronize();
61 double t1 = tim.lap_ms(); printf("time %.3f ms\n",t1);
63 std::atexit( []{cudaDeviceReset();} )
64 return 0;
65 }
○ Line 48: Launches the kernel mashdata to process that data, specifying the required stream as a
that this transfer overwrites the original contents of the host array data for that frame.
7.3 Thrust and cudaDeviceReset 215
Notice this set of three statements will be run sequentially on a particular stream; the use of
cudaMemcpyAsync permits the overlap of operations in different streams on the GPU not
within the same stream.
• Lines 52–54: These are the same as lines 47–49 but using the default stream for all frames. These
statements are selected for the case maxcon≤0. We expect no asynchronous behaviour on the GPU
in this case although the cudaMemcpyAsyc calls will still not be blocking on the host.
• Lines 56–58: Here we increment all data pointers to point to the next frame.
• Line 60: A cudaDeviceSynchronize call is made here which causes the host to wait for work
on all streams to be done before proceeding. This is necessary for timing purposes and for the host to
be able to use the results copied from the GPU. Note that if we were using simple synchronous
cudaMemcpy calls these would be blocking and this call would be unnecessary. If you are using
asynchronous GPU operations you have to be more careful with explicit synchronisations.
• Line 61: Here we print the time taken for the processing. The really interesting features of the code
can only be investigated using a profiling tool like NVVP.
• Line 62: This is the placeholder indicating where code to process the results now in the host array
data could be further processed or written to disk.
• Line 63: This is the end of the program where we need to “tidy up” to free any resources allocated
during execution.
The timing results at the end of this example show a factor of 2 speed-up between running all work on
the default stream (maxcon=0 case) and eight streams (maxcon=8 case).
The complicated nature of line 63 has an interesting explanation which is explained in the
next section as a little digression before we move on to the results.
The atexit function takes a single argument which is a pointer to the function contain-
ing the code to be run on exit. In line 63 we supply the anonymous Cþþ11 lambda function
[]{ cudaDeviceReset(); }
as the argument. This provides a nice solution to the incompatibility issue. The atexit
function can in fact be placed anywhere in your code, but in this case placing it just before
the exit is the natural choice.
Figure 7.1 Timelines for three-step pipeline code generated using NVVP
218 Concurrency Using CUDA Streams and Events
Figure 7.1 (c) shows the timelines for ktime=1. Now the kernel time is almost negligible
but the run time is only very slightly reduced to about 116 ms. The H2D and D2H IO still
overlap except for the first and last frames.
Figure 7.1 (d) the kernel time is now much longer than the time to transfer a frame, so we
see that, while the H2D transfers are still without gaps, there are increasing delays opening
between their completion and the dependant kernel launch. We also note that there are now
places in the timeline where two kernels are running simultaneously. Running this calcula-
tion on the default stream with one frame takes about 500 ms whereas running with 32 frames
on separate streams takes about 315 ms, so in this case it is the IO time which is
almost “free”.
cudaEvent_t event1;
cudaEventCreate(&event1);
A CUDA event is created with its initial status set to completed. The event’s subsequent
sanet.st
status can be queried using cudaEventQuery(event1); which returns either
cudaSuccess if the event is completed or cudaErrorNotReady if the event is not
yet complete.1 Calls to cudaEventQuery are non-blocking on the host.
The host program can insert CUDA events into CUDA streams between kernel execution
and/or memory copy operations using the cudaEventRecord function. For example:
cudaEventRecord(event1);
or
cudaEventRecord(event2,stream3);
If the stream argument is omitted or set to zero, then the default stream is used. Like kernel
launches, calls to cudaEventRecord are non-blocking on the host and in general it will
take some time before the stream into which the event has been placed catches up and
processes the record instruction. During this interval the event’s status is not completed. The
host can monitor progress using cudaEventQuerey(event). When all pending oper-
ations preceding the event in the relevant stream have been completed the event will change its
status back to completed and the completion time will be stored in the event. The host code can
also use cudaEventSychronise(event) to wait until a particular event has been
completed, note that while this call is blocking on the host it does not affect operations on
any active CUDA streams and thus it is a better choice than cudaDeviceSynchronize()
7.5 CUDA Events 219
Example 7.2 event1 program showing use of CUDA events with default stream
. . .
// kt1 & kt2 control kernel execution times
25 int kt1 = (argc >4) ? atoi(argv[4]) : 100;
26 int kt2 = (argc >5) ? atoi(argv[5]) : kt1;
27 thrustHvecPin<float> host(dsize);
28 thrustDvec<float> dev_in(dsize);
29 thrustDvec<float> dev_out(dsize);
30 for(int k=0; k<dsize; k++) host[k] =
(float)(k%77)*sqrt(2.0);
31 dev_in = host; // copy H2D
sanet.st
○ Line 34: Runs mashData kernel with time control parameter kt1.
○ Line 35: Performs a cudaDeviceSynchronize so that the host waits for all pending CUDA
work to complete. In this case, only the kernel from line 34 is pending.
○ Line 36: Stores elapsed time since initialisation or last reset of tim in variable host_t1. This
○ Lines 38–39: Wait for completion of second kernel and store elapsed time in host_t2.
• Lines 40–55: This is an alternative version of the two-kernel timing code using CUDA events to
perform the timing; this is more verbose but does not use cudaDeviceSynchronize.
○ Lines 40–43: Declare and initialise the 4 CUDA events start1 and 2 and stop1 and 2; these
will be used to measure the time intervals required for running the two kernels.
○ Line 44: Places a cudaEventRecord for start1 in the default CUDA work stream.
○ Line 46: Places a cudaEventRecord for stop1 in the default CUDA work stream.
7.5 CUDA Events 221
○ Lines 47–49: These are the same as 45–47 but using the events start2 and stop2 for timing
the second run of the mashdata kernel. Notice that there are no host blocking instructions
between line 38 and this point.
○ Line 50: This is a place holder; at this point in the code, extra host work can be performed which
will run asynchronously with the kernels launched in lines 45 and 48.
○ Line 51: At this point, all asynchronous work has been added to the default CUDA stream and any
extra host work has been done so we resynchronise operations of the host and GPU issuing a
cudaEventSynchronise for the stop2 event. The host will be blocked until the CUDA
stream reaches the stop2 event. We choose stop2 because it is the last event to have been
inserted into the default stream; thus we can be sure that it, together with all the preceding tasks in
that stream, have been completed. Note the host is only blocked for work in the CUDA default
stream; any work in other active CUDA streams would continue asynchronously. This is different
to cudaDeviceSynchronize which blocks the host until work on all CUDA streams
is complete.
○ Lines 52–55: Here we find the execution times for the two kernels using
cudaEventElapsedTime to store the stop1-start1 and stop2-start2 time differ-
ences in event_t1 and event_t2; the units are ms.
• Line 56: Calculates difference between first and second sets of timing measurements.
• Line 57: Prints the measured times.
A typical result for a kernel run time of about 3.4 ms is shown at the bottom of Example 7.2.
In this case we find that timing using CUDA events is consistently about 0.15 ms faster than
using host-based timers needing cudaDeviceSynchronize(). This difference, although
small, is quite interesting. It shows that cudaDeviceSynchronize() has a significant
overhead and using CUDA events can give more accurate kernel timings. In more complex
cases, where multiple CUDA streams are being used, cudaDeviceSynchonize() will
block the host’s progress until all active streams complete their work whereas CUDA streams
allow the host to wait for a specific point on a single stream.
This last point is illustrated in Example 7.3 which is a modification of the previous
example to run the two kernels on two different CUDA streams rather than running both
on the default stream.
. . .
25 int kt1 = (argc >4) ? atoi(argv[4]) : 100;
26 int kt2 = (argc >5) ? atoi(argv[5]) : kt1;
27 int sync = (argc >6) ? atoi(argv[6]) : 0;
34 thrustDvec<float> dev_out2(dsize);
37 // initialise buffers
38 for(uint k=0; k<dsize; k++) host2[k] = host1[k] =
(float)(k%77)*sqrt(2.0);
39 dev_in1 = host1; dev_in2 = host2; // H2D transfers
// optional pause in s2
63 if(sync != 0) cudaStreamWaitEvent(s2,stop1,0);
64 cudaEventRecord(start2,s2);
65 mashData<<<blocks,threads,0,s2>>>(dev_in2.data().get(),
dev_out2.data().get(), dsize, kt2);
66 cudaEventRecord(stop2,s2);
67 // all work now added to CUDA streams
In Example 7.3 we show three different strategies for measuring the kernel times, the first
two are on the host, and the third uses events on the GPU. The first host method uses a
cudaStreamSynchronize immediately after each kernel launch so that the two kernels
actually run sequentially, albeit on different streams. The second host method uses a single
cudaDeviceSychronize after both kernels have been launched. If there is any overlap in
the execution of the two kernels we would expect the time measured by the second method to
be less than the time measured by the first. Finally, the third method uses CUDA events instead
of host timers to measure the times taken by the two asynchronous kernels. To show the effect
of cudaStreamWaitEvent, there is an optional call to this function in line 63 between
two calls to cudaEventRecord and the two sets of results shown illustrates its effect.
224 Concurrency Using CUDA Streams and Events
○ Lines 44–45: Wait for all pending work in s1 to complete and then store the time taken in
host_t1.
○ Lines 45–47: Launch mashData on stream s2 using the second set of buffers, wait for
completion using a second cudaStreamSynchronize and then store the time taken in
host_t2. sanet.st
• Lines 49–53: Repeat the kernel launches in CUDA stream s1 and s2 but now without calling
cudaStreamSynchronize; hence this time the two kernels may execute simultaneously on the
GPU. If this does occur, we expect that the measured time to complete both executions, host_t3,
to be less than the sum of host_t1 and host_t2 times from the sequential runs. Note that we use
cudaDeviceSynchronize in line 52 to ensure that the pending work in all streams is complete
before measuring host_t3, in line 53. This implements our second host timing method.
• Lines 54–66: In this section we implement our third timing method using CUDA events instead of
host-based timers.
○ Lines 55–58: Declare and create start and stop events for the two streams.
pending event, stop1, on stream s1 completes. This call is dependent on the user settable
parameter sync. If this call is made, we would expect nothing to be done on stream s2 until all
work on s1 is finished and thus the execution of the two kernels is sequential, therefore, the same
as the first host-based timing method.
○ Lines 64–66: These are the same as lines 60–62 but now run the kernel in stream s2 bracketed by
timing purposes when multiple streams are active. The call to cudaStreamWaitEvent can
partially solve this problem by forcing the steps in streams s1 and s2 to run sequentially rather
than asynchronously. However, this does not give us any information about timing in the
asynchronous case.
• Lines 68–73: Here we use the timing information recorded in the start and stop events. Note
that the statements in lines 54–66 are not blocking on the host; to ensure that all events have
been properly recoded before attempting to use cudaEventElapsedTime, we use
cudaEventSynchronize in lines 68 and 71.
• Lines 74–77: Here we print the results; two example sets of results are shown at the bottom of
the figure.
The timing results shown at the end of Example 7.3 are for running event2 with and
without the calling cudaStreamWaitEvent (line 63 above). In the first case (no wait)
both ht3 and et2 are the net time for overlapped execution of the two kernels. In the second
case the et1 and et2 each measure the time for individual kernel executions with no overlap.
Hence et1þet2 is about the same value as ht1þht2 and both give the time required for
executing the two kernels without overlap. For the mashData kernel running two instances
of the kernel simultaneously on the GPU gives roughly an 8% speed-up.
To illustrate this more clearly, Figure 7.2 shows the timelines obtained using NVVP for
the two cases.
The cudaStreamWaitEvent is only one of many functions available for managing
activity on multiple CUDA streams, possibly spread across multiple GPUs. A few of these
functions are shown in Table 7.1 and the CUDA Runtime API Reference Manual contains
full details on all the functions for stream management in Section 5.4 and for event
management in Section 5.5.
Element Comment
std::thread t; Create thread object t using default constructor. The thread is
created but inactive.
t(fun,arg1,arg2,. . .); Activate the thread t to run fun(arg1,arg2,. . .)
asynchronously with parent code.
bool t.joinable(); Test to see if thread t is active.
void t.join(); Wait for thread t to finish.
sanet.st
The idea is to run a set of steps, where each step processes different pipeline stages for a
number of frames. A typical step might run the computation for frame n and also handle IO
steps for frames n‒1 and nþ1. The actual scheme we use is shown in Figure 7.3. To process
N frames, it turns out we need Nþ3 steps in total to cater for an initial read before any
computation and a final write after the computation has finished. Since we are going to read
7.6 Disk Overheads 227
and write different frames of data using asynchronous threads, we must take care that one
read has completed before the next one is launched. That is why in the table each read Rn is
preceded by a JRn‒1 and likewise each write Wn is preceded by JWn‒1 in the scheme. (Here
R stand for read, W stands for write and J stands for join, the Cþþ threads equivalent of
cudaStreamSynchronize).
The asyncdiskIO code corresponding to Figure 7.3 is shown in Examples 7.4 and 7.5.
38 return 0;
39 }
69 thrustHvecPin<float> inbuf1(fsize);
70 thrustHvecPin<float> inbuf2(fsize);
71 thrustHvecPin<float> outbuf1(fsize);
72 thrustHvecPin<float> outbuf2(fsize);
73 thrustDvec<float> dev_in(fsize);
74 thrustDvec<float> dev_out(fsize);
98 fclose(fin);
99 fclose(fout);
100 std::atexit( []{ cudaDeviceReset(); } );
101 return 0;
101 }
or (especially) re-reading the file of interest. We find that on our test PC 10 GB is sufficient. The flushing
process takes about 90 seconds so it is useful to have a flag to turn this on or off.
○ Line 60: Sets the buffer size for reading flush files.
○ Lines 64–65: For each of the flush files create a filename and read the file. This version assumes
the files A1.bin etc. are on the current working directory, but the sprintf statement on line
64 could include path names as required.
○ Line 66: After reading each file it might be necessary to use the data in some way to prevent a
smart compiler from optimising out this section of code. We choose to print one word from each
file. This also gives you a simple progress indicator to watch while waiting.
• Lines 69–74: Allocate the host and device input and output buffers as thrust vectors. Notice the
buffer sizes are one frame which may be considerably less than the size of the full dataset. Although
Figure 7.3 appears to suggest we need a separate set of buffers for each frame, it turns out that only
two are sufficient, one for the even columns in the figure and another for the odd columns.
• Lines 75–76: Here we open the input and output data files for binary reading and writing. If errors
occur, the file pointers fin and/or fout will be set to nullptr. It is really important to include
error checking in production code at this point.
• Lines 77–80: Here we declare four host thread variables r1, r2, w1 and w2 which are used for
asynchronous read and write operations.
Lines 81–102: This is the most interesting part of the code and implements the scheme of Figure 7.3. The
implementation is a while loop based on the step counter fstep, the body of the loop has two sections
where each section represents one column in Figure 7.3. The first section (lines 84–88) corresponds to
even columns in the figure and the second section (lines 91–95) corresponds to the odd columns.
• Line 81: Initialises the variable fstep which serves as a column counter in the following
while loop.
• Line 82: This is the start of our while loop over fstep. On entry fstep has the value zero. Note the
termination condition is fstep=frames+3 to allow for extra steps at the start and end of the process.
The conditional clauses on lines 86–88 and 93–95 are used to cater for the blank boxes in Figure 7.3.
• Lines 84–85: Here we check on and if necessary wait for completion of the pending read and write
operations corresponding to JRn and JWn for even n. In the Cþþ <threads> library join and
joinable are member functions of thread objects like w1. The join function waits for an active
thread to complete or returns immediately if the thread is not active. The function joinable
returns true if the thread is active and not detached.
• Line 86: Here we launch thread w1 to write the result of the previous step to outbuf1. This
corresponds to Wn for odd n in Figure 7.3. Notice the deceptively simple syntax for launching a host
thread. We assign a thread newly created by the class initialiser constructor to w1. This thread
executes the function call specified in the constructor argument. In this case we pass a call to our
write_block function.
• Line 87: Launch thread r1 to read data for the next step into inbuf1. This corresponds to Rn for
odd n in Figure 7.3.
• Line 88: Here we call swork using inbuf2 for input and outbuf2 for the result. This call is
blocking on the host thread but not on the asynchronous threads w1 and r1. This corresponds to Cn
for even n in Figure 7.3.
• Line 89: Increment fstep to move on to the next column of Figure 7.3.
• Lines 91–86: These are the same as lines 84–89 but for the odd columns; thus r1 and w1 are
replaced by r2 and w2 and vice versa. Likewise, inbuf2 and outbuf2 are replaced by inbuf1
and outbuf1.
• Lines 98–100: Tidy up at the end of the while loop and exit.
232 Concurrency Using CUDA Streams and Events
Note that we have omitted the performance timing code from these listings but it is present in
the versions in our code repository.
Some results obtained from running Example 7.5 on out test PC are shown in Table 7.3.
All the tests were run using a 1 GB dataset and 4 values of the ktime parameter. All the
times are shown in seconds. Three pairs of columns are shown: baseline is the case with no
overlap of disk IO and kernel execution; kernel overlap is the case where only a single thread
was used to overlap IO operations with the kernel execution but with no overlap between read
sanet.st
and write operations. The last pair of columns are for the asyncDiskIO, Example 7.5 with
full overlap of kernel execution and of read and write IO operations.
It turned out to be quite tricky to get reliable timing measurements for this example; the
reason is that modern operating systems go to great lengths to hide the latency and slow read/
write speeds of hard drives. For read operations they cache recently read files in memory so
that, after an initial slow first read operation, subsequent reads of the same file are fast. Write
operations are also cached, in that data is not written directly to the hard drive but is queued
for writing using caches in host memory and the drive hardware. In experiments on my
Windows 10 PC we found that the read cache worked for up to 10 GB of recently read files
and for write operations a command line driven program writing a large file might appear to
complete quickly but then the OS would wait for the write operation to finish before
responding to the next command. Thus, to mimic a pipeline processing situation, we used
a script file to run 12 jobs each using different pairs of input and output files with command
line timing steps between each job. This tactic defeats both the read and write caching. The
times of single jobs still showed significant fluctuations due to other background activity on
the system and the results in Table 7.3 are the median job times from each batch of 12 jobs. It
is possible that other operating systems might behave differently. This tactic is better than
simply flushing the read cache as discussed above.
The four rows of Table 7.3 show results for four different values of the ktimes parameter
namely 10, 6400, 12800 and 19200. The GPU times including thrust transfers between the
host and GPU is shown in the first column;4 these times were measured using a separate
program. Results for two different disk configurations are shown in the table; Case A uses
the same conventional hard drive for both the read and write operations whereas Case B uses
7.7 CUDA Graphs 233
separate drives, a fast SSD drive for reading files and a conventional hard drive for writing
files. In the past when disk IO was uncached attempting to read and write to the same
physical drive simultaneously, it could have a catastrophic impact on performance as the
mechanical read/write heads would be constantly in motion. However, we see no evidence
for this problem in our results for Case A – a good demonstration that caching is working.
Most of the differences between cases A and B are due to the SSD drive being five times
faster than the conventional drive. In the four rows of results, we see the following:
• GPU time 0.19 seconds: There is little difference between the three cases which is not
surprising as there is almost no GPU time to overlap with disk IO.
• GPU time 8.45 seconds: Now the GPU time is about the same as the time to read or write
the data to the hard drive. Here we see speed-ups of 21% and 28% for case A and 51% and
52% for case B. For both cases this is equivalent to the entire GPU computation being
hidden and for fully overlapped IO there is an additional saving of about 2 seconds.
• GPU time 16.77 seconds: Here the GPU time is about the same as needed for the sequential
reading and writing of data to the hard drive in Case A. The results show that for the fully
overlapped IO of Example 7.5, the kernel execution dominates the computation time with
disk IO adding about 5.3 seconds for case A and only 0.21 seconds for case B. Without the
overlapping of disk IO these overheads rise to 9.06 for case A and 0.88 seconds for case B.
• GPU time 25.17 seconds: Here the GPU time dominates the calculation time for fully
overlapped disk IO which is now essentially free. For serial disk IO in case A there is still a
time penalty of about 8 seconds.
As a final remark it is worth noting that, while the compute task in our Example 7.5 involved
GPU computation, any CPU-based computation would also benefit from the overlapping of
computation and disk IO as shown here. One approach would be to simply change the
function swork as necessary.
Graphs are also convenient in more complex situations, such as deep learning applications
because once created, a graph can be launched multiple times or embedded as a single node
in another (parent) graph. Some possible graph topologies are shown in Figure 7.4.
In this figure the nodes in a graph represent activities and can be a kernel call, a
cudaMemcpy or cudaMemset operation, a CPU function call, an empty node or child
234 Concurrency Using CUDA Streams and Events
graph. The edges in the graph represent dependencies. Graph (a) shows a simple linear case
where each node depends on the previous node, for example H2D memcpy, kernel1, kernel2
and D2H memcpy. Graph (b) shows a case where nodes B and C do not depend on each
other but do depend on node A. Here nodes B and C can run concurrently and will be
automatically launched to do so when the grid is run. Graph (c) shows a three-step linear
graph where the middle node Y is actually the whole of graph (b) embedded as a child graph.
This figure is based on the NVIDIA Cþþ Programming Guide.
CUDA graphs can actually be created in two ways. In simple cases, such as a linear graph
without complex dependencies, an existing workflow can be “captured” as a graph. The
sanet.st
resulting graph can then be run, possibly multiple times, or be embedded in a more complex
parent graph. If there are more complex dependencies, then the second method is better and
uses CUDA API functions to explicitly state inter-node dependencies while leaving the
details of organising concurrent execution to the system.
Example 7.6 is based on the “Getting Started with CUDA Graphs” blog post by Alan
Gray5 and shows how a simple workflow involving many kernel launches can be captured
and how the kernel launch overhead is then reduced when the workflow is run as a graph as
compared to being run directly.
17 thrustHvec<float> host_data(size);
18 thrustDvec<float> dev_out(size);
19 thrustDvec<float> dev_in(size);
24 cx::MYTimer tim;
25 for(int n=0; n<steps; n++){
26 for(int k=0; k<kerns/2; k++){ // ping pong
27 scale<<<blocks,threads,0,s1>>>
(trDptr(dev_out),trDptr(dev_in),size,lambda);
28 scale<<<blocks,threads,0,s1>>>
(trDptr(dev_in),trDptr(dev_out),size,lambda);
29 }
30 }
31 cudaStreamSynchronize(s1);
32 double t1 = tim.lap_ms();
33 float x1 = dev_in[1];
34 printf("standard time %8.3f ms check %f
(expect %f)\n",t1,x1,host_data[1]*10.0);
50 float x2 = dev_in[1];
51 printf("using graph time %8.3f ms check %f
(expect %f)\n",t2,x2,host_data[1]*10.0);
52 return 0;
53 }
○ Lines 25–26: A pair of nested for loops, the outer loop has steps iteration and the inner loop has
kerns/2 iterations.
○ Lines 27–28: These two kernel calls are the body of the loops. The kernel scale is called twice
using the arguments dev_in and dev_out in ping-pong so that after each pair of calls the
contents of dev_in have been multiplied by lambda2. Note we launch the kernels using the
non-default stream s1.
7.7 CUDA Graphs 237
• Line 31: Wait for the work on stream s1 to finish before checking the host timer.
• Line 32: Store the time for the double loop in t1.
• Line 33: Copy the value of dev_in[1] to the host variable x1 implicitly using thrust. This should
have been scaled up from 1.0 to 10.0 by the kernel calls.
• Line 34: Print t1 and x1.
• Line 35: Restore dev_in to its original starting value, prior to repeating the calculation in the
second processing block.
• Line 36: Reset the host timer at start of second timed block.
• Lines 37–43: Capture the CUDA work contained in the inner loop (lines 26–29) of the first
processing block to the cuda_graph_t graph CUDA graph object.
○ Line 37: Begin capturing work from CUDA stream s1 by calling
cudaStreamBeginCapture
○ Lines 38–41: These lines are exactly the same as lines 26–29 for the first processing block, but
graph graph.
• Line 44: The cuda_graph_t graph object cannot be launched directly; instead a separate
cudaGraphExec_t object has to be derived (or instantiated) from it by using
cudaGraphInstantiate. This is done here to create the launchable object g.
• Line 46: Create a second CUDA stream s2 which will be used to launch g. Note it is not necessary
to use a different stream to the one used for capture and we could use either s1 or s2 in line 47.
• Line 47: This single line replaces the outer loop in the previous block (lines 25–30) and launches the
graph g steps times in stream s2 using cudaGraphLaunch.
• Lines 50–52: Display the results from the second processing block, similar to lines 32–34 from the
first block.
The results shown are for processing arrays of 216 4-byte floats using 50,000 kernel calls
grouped into 1000 sets of 50 calls. The time per kernel measured on the host was typically
2.853 µs for the first method reducing to 1.793 µs for the second method where sets of 50 calls
were launched as a CUDA group. This is a reduction of 1.059 µs or 37%.6 Note that at present
the performance gain is only significant for kernels with microsecond execution times. This is
because at present CUDA graphs mainly reduce kernel launch overheads. However, this is a
relatively new feature and may become more powerful in future versions of CUDA.
A summary of the functions needed to create a CUDA graph using capture is shown in
Table 7.4.
CUDA also provides an alternative set of API functions to allow a graph to be defined
directly rather than using capture. This alternative is considerably more verbose than the
capture approach. Nevertheless, it might be useful in situations where there are complex
dependencies between nodes in the graph because those dependencies can be defined
directly not implicitly via events. We will not give an example here but interested readers
should look at the simpleCudaGraphs example in recent versions of the CUDA SDK.
This chapter has focused on overlapping IO operations with kernel and/or CPU activity as a
means of further accelerating parallel code. We have looked at both transfers between the host
and GPU and between the host and external disk drives. The latter case is not really a CUDA
topic but is discussed here because this is often a bottleneck in data processing pipelines.
In the next chapter, we return to random numbers with a substantial example based on a
modern medical imaging positron emission (PET) scanner. These scanners work be detecting
238 Concurrency Using CUDA Streams and Events
Table 7.4 API functions needed for creation of CUDA graphs via capture
Endnotes Chapter 7
1 The function cudaEventQuery can also return some other error codes indicating problems with the
input event argument. See the CUDA Runtime API documentation for details.
2 PCs do a lot of data caching when reading from disk. Thus, reading a large file for a second time is often
much faster than reading it for the first time. These caches may be many GB, in order to time disk IO
realistically we have to read up to 10 GB of data from other files between each test run in order to
“flush” the cache and force the PC to actually read data from disk each time we run a test.
3 As is our custom no error checking is shown in the listings for these functions. However, the real-world
code does indeed contain error checking which is particularly important for IO operations as problems
with data files at run time can never be excluded.
4 In a previous example we have shown you how to overlap these transfers with kernel execution, but we
have not included this refinement in Example 7.4 in order not to overcomplicate the code. In fact the
thrust IO takes a total of about 180 ms in this example, which is small compared to the disk transfer
overheads which are about 17 seconds.
5 Getting Started with CUDA Graphs, Alan Gray https://devblogs.nvidia.com/cuda-graphs/,
5 September, 2019.
6 Note these are typical numbers for my RTX 2070 GPU running on Windows 10. There is quite a lot of
variation between individual runs.
8
In this chapter we show how CUDA can be used with a much more substantial calculation
with features typical of many research applications. Specifically, we exploit the use of fast
random number generation to perform a simulation. The simulation of experimental equip-
ment is now an important tool in many areas of research.
In this application chapter we show how the GPU can be used to greatly speed up both
calibration and image reconstruction in clinical positron emission tomography (PET) scanners.
Developing this code will illustrate points of general interest in designing GPU code, specific-
ally the mapping of symmetries to GPU threads and careful design of memory layout. The host
code needed for this application is more complicated than in our previous chapters but the
GPU code (which does nearly all the work) is all contained in a small number of short kernels.
Full details, including all the host code, are available on our code repository.
Section 8.1 sets up the problem in some detail including important details of how the data
is organised to take advantage of the symmetries in the system. This section is quite long and
could be skimmed over in a first reading. Our simulation code in Section 8.2 is quite
straightforward and will help readers to simulate other systems without needing to master
all the details in Section 8.1.
239
240 Application to PET Scanners
An early description of the method can be found in Linda Kaufman’s paper.1 This paper is
also interesting because it contains an early discussion of the advantages of using polar rather
than Cartesian voxel geometry for event reconstruction.
The MLEM method is iterative and uses the deceptively simple formula shown in eqn 8.1:
anv X
Nl
ml S vl
avnþ1 ¼ , (8.1)
P
Nl
l¼1
P
Nv
n 0
S vl0 av0 S v l
l 0 ¼1 v0 ¼1
where anv is the estimated activity in voxel ν at iteration n, ml is the measured number of
decays in LOR l, and S vl is the PET system matrix (SM). The SM elements, S vl , are defined
as the probabilities that a decay in voxel ν is detected in LOR l. The iteration is usually
started by assigning equal values to all the a0v . The summation limits Nl and Nv are the total
numbers of LORs and voxels involved.
Applications of Eq. 8.1 are not confined to PET; it is applicable to all tomographic
applications and more generally to any problem where a set of detected measurements from
multiple sources is used to estimate the strength of those sources,2 Another example of one
such application is given at the end of this chapter.
There are three summations involved in Eq. 8.1 each of which has a direct physical
interpretation. P
The denominator term FPl ¼ v0 anv0 S v0 l is the net contribution of all the currently
estimated voxel activities anv0 to the LOR l. This is referred to as forward projection (FP)
of subject
P activity into the detector. The middle term can then be written as
BPv ¼ l ml S vl =FPl which is the backward projection (BP) of all the detected decays ml
back into the voxels. As the iteration proceeds we expect anv to tend to ml so that the BPv
factor will tend to a constant which is just the overall probability that a decay in voxel v is
detected somewhere in the P detector; therefore, it is the detection efficiency for that voxel.
The final summation term l0 S vl0 is also just this detection efficiency, so that the overall
factor multiplying anv tends to one as the iterations proceed.
The problem with the MLEM method is that for a useful PET system, the matrix S vl is
enormous – but this is just what makes it an interesting challenge for GPU code. We will
consider a clinical system with 400 detectors per ring and 64 rings and assume each detector
has a face 4 4 mm in size. This leads to an inner ring radius of about 256 mm (enough for a
head or thin subject).
per ring). The final array size is Nv = 400 100 64 = 2,560,000 words corresponding to
~107 bytes if we use floats to store the values. This is not a problem.
• The LOR array is more difficult. Each detector is defined by two numbers, “c” (for crystal)
the angular position around the ring in the range 0–399 and “z” the detector ring number
in the range 0–63. A LOR is defined by a pair of detectors (c1,z1) and (c2,z2). The
total number of possibilities is therefore potentially (400 64)2 ~ 6.55 108 which
requires over 2.6 109 bytes to store as 4-byte integers – this is a problem. We can reduce
the size somewhat by noting that LORs have no preferred direction; thus we can adopt
the convention that z1 ≤ z2. Furthermore, if z1=0 there are 64 possibilities for z2, if
z1 = 1, there are 63 possibilities for z2 and so down to one possibility for z2 if z1=63.
Thus, the total number of (z1,z2) pairs is only 64þ63þ. . .þ1 = 2080. Having adopted
our convention for z1 the associated c1 must be allowed to have all 400 possible values,
but for each c1 value we can impose a restriction on the corresponding c2, namely that it
differs from c1 by at least 100. This restricts c2 to 201 possibilities and effectively puts
c2 into the opposite half of the detector in the transverse plane.3 For example, if c1=0
then 100≤c2≤300. Finally, we note that all four values can be packed into a single 4-byte
field allowing 9-bits for c1 and c2 and 7 bits for z1 and z2. The resulting storage
requirement is now Nl = 2080 400 201 = 167,232,000 words, corresponding to
668,928,000 bytes, still a big number, but the best we can do. It will turn out that the
performance of our code is limited by access to this array.
• The system matrix s has a nominal size of Nv Nl which is about 4.28 1014 (we did
mention it was enormous), but fortunately s is a very sparse matrix and has many
sanet.st
symmetries.
○ Polar voxels express the rotational symmetry of the scanner; therefore, the LORs
have identical s values for all z such that z1+z ≥0 and z2+z ≤63. This symmetry is
exploited by loops over z in our kernels. The voxel position is also adjusted by z when
applying this symmetry.
○ Other symmetries exist for example reflection symmetry about the vertical axis in the
the voxel z position and not absolute detector positions. This change is helpful when
applying the z-translation symmetry. The last 4 bytes are the floating-point value of the
detection probability calculated by our simulation. It will turn out that for each voxel ring
sector we need to store about 1.3 105 values resulting in a total SM size of about 1.2 108
bytes, an order of magnitude less than what we need for the LORs.
On a conventional system each MLEM iteration requires many minutes of CPU time and,
as usually a large number of iterations are required, a full MLEM PET reconstruction takes
hours. To speed up the process people often use a subset of the LORs for each MLEM
interaction and on each iteration a different subset is used. This so-called ordered subsets
expectation maximisation method (OSEM) was introduced by Hudson and Larkin in 1994.5
We will show examples of reconstruction using both the full MLEM method and OSEM.
Figure 8.3 Illustrates the coordinates and storage methods for both detected LORs and
system matrix elements.
Figure 8.3 Shows a sketch of the longitudinal and transverse views of a detected decay in
the PET scanner. For a detected event, the LOR contains the c and z of the two detectors that
fired in coincidence and the total number of times this occurred (counts). For the SM the
LOR contains the z displacements from the known decay voxel and the c values of the LOR
when the decay voxel has c=0. The packed key format used allows for a maximum of
128 detector rings each having up to 512 crystals. (More precisely a PET scanner could have
more than 128 rings but only use detected LORs with a maximum z difference of 127).
The code described in this chapter has a number of components including:
1. Calculation of the SM of the scanner using a Monte Carlo method. The PET scanner is
defined by the header file scanner.h which contains parameters defining the detector
elements and the polar voxel grid used by the programs. All the programs in this chapter
use this file and the code should work for other scanners and voxel grids if this file is
changed (but note the limits on ring and detector numbers implicit in the 32-bit key
244 Application to PET Scanners
format). The constants are defined using Cþþ11 constexpr and not with C/Cþþ
macros. This has the advantage that all derived values are evaluated at compile time and
that the parameters are treated as const.
The system matrix is calculated by simulating the detector response to decays in repre-
sentative voxels using the fullsim program. The file fullsim.cu contains all the
necessary host code and kernels. On its own this program provides a nice example of a
physics simulation.
2. The fullsim program is designed to be run separately for each different set of radial
distances of voxels from the central axis of the scanner. All the results from fullsim are
then assembled into a system matrix. The is done with the short readspot program
which just uses host code contained in readspot.cpp.
3. We implement a full MLEM reconstruction for the scanner. For this we need a test or
phantom data set. The creation of such datasets is implemented as an option in the
fullsim program. This is done rather easily by extending the simulation volumes used
by fullsim from single polar voxels to other volumes such as ellipsoids.
4. We implement MLEM from Eq. 8.1 efficiently in GPU code; this is done with the reco
program in the file reco.cu. The core kernel code in this program is fast and requires
less than 60 lines. There are many real-world applications of MLEM so this example
should be of interest for many tomographic and other applications.
5. For further speed-up and to compare with common practice we also implement an OSEM
version of our PET reconstruction. For this we need to sort the system matrix into subsets
sanet.st
and that is done with the host program smsplit.cpp.
6. Implementing OSEM with the modified system matrix is just a simple modification to
reco.cu and is contained in the separate recosem program contained in the file
recosem.cu.
7. To inspect our reconstructed images, we have to remap the polar voxel reconstructions
back to a Cartesian grid for display and potential quantitative analysis. This is done by
the simple host poluse program contained in poluse.cpp. This program uses a
simple lookup table which is precalculated using another simulation on the GPU by the
pol2cart program in pol2cart.cu file.
The file scanner.h defines the following basic scanner parameters:
The scanner is defined by cryNum the number of detectors per ring (set to 400),
crySize the dimensions of the square detector faces (set to 4 mm) and zNum the number
of detector rings (set to 64). In addition, the depth of the detectors in the transverse plane
cryDepth is specified as 20 mm; this parameter will be used if we include depth of interaction
in our simulation. We use the prefix cry for detector size parameters because most usually the
8.2 Data Storage and Definition of Scanner Geometry 245
detectors are in fact scintillating crystals, in this case having size 4 4 20 mm.6 This is also
why we use c for the angular position of detectors around the rings.
Numerous secondary parameters can then be derived, for example the inner radius of the
system is
detRadius = cryNum*crySize/cx::pi2<>.
Note this formula assumes that the inner faces of the detectors form a perfect circle which
is an approximation for real PET systems.
In Figure 8.4 we show a possible coordinate systems for voxels in transverse plane for a
small PET system having 32 detectors per ring and a polar voxel grid of 7 rings. The grid
shown has 32 voxels per ring. Optionally adjacent voxels in the same ring can be merged to
maintain approximately equal volumes for all voxels. In the figure in the outer three rings are
shown with the full 32 voxels per ring and the inner four rings are shown with 16 voxels per
ring. The integer ring coordinate r starts at zero for the innermost ring. The angular coordin-
ate c is the detector number within each detector ring starts at zero at the top of the ring and
increases with clockwise rotation. The voxels are arranged in a 3D stack of z-slices with
typically either one or two slices per detector ring. The integer z coordinate determines
which slice the voxel belongs to. Thus overall, we use 3D cylindrical polar coordinates (r, c, z)
for the voxel grid. Conventionally the coordinate origin is on the central axis at one end of the
scanner so that z coordinate would run from 0 to 15 for a 16-ring scanner. In the examples
presented here we do not merge voxels at smaller r. This keeps our code simple and means that
the values of the integer polar coordinates (c, z) are also used to identify individual detectors in
the PET scanner. Floating point versions of these coordinates are also extensively used in the
simulations for exact positions of decay events within a voxel and interaction points within a
detector. Note our polar angle is unconventional in that it starts at zero for directions parallel to
the y-axis and then increases clockwise.
246 Application to PET Scanners
For the scanner simulated in our code we use a conventional z coordinate either running
from 0 to 63 for integer detector numbers or continuously from 0.0 to 256.0 for exact axial
positions. In the simplest simulation we use 64 axial voxel slices of thickness 4 mm to
exactly match the detector rings. Thus, the voxel dimensions are typically 2 2 4 mm in
the r, c and z directions. The c dimension is approximate and varies with ring position as the
voxels are wedge shaped not cubical. These details are summarised in Table 8.1. In real-
world clinical PET scanners, it is usual to use voxel z-dimensions of half the detector
z dimension to give a more uniform resolution in all directions. In this case the z slices
are usually arranged so that alternate slices are centred on either detector centres or detector
edges. In our case this would lead to 127 z slices where the first and last slices have widths of
3 mm and the other slices have widths of 2 mm.
The following parameters are defined in scanner.h to assist in creating and managing
data storage:
The parameter mapSlice is the number of detectors in one ring times the number of rings.
This is just the total number of detectors in the system, therefore, 400*64 = 25,600. When we
generate LORs in our simulation, we choose to accumulate the results in an array large enough
to hold all possible combinations. This requires an array of 4-byte uints of size mapSlice2,
amounting to about 2.6 GB. The array map is used in our code on the GPU while generating
events in the fullsim program. Thus, to run the examples in this chapter a GPU with a
minimum of about 4 GB of main memory is required. The simulations are done separately for
each detector ring, which in our case means 100 different simulations are required to generate a
complete system matrix. Were we to simply copy all the resulting map files back to host disk
space, that would require ~260 GB which would be slow and wasteful.
Some compression of the map file is possible. As explained above, the number of valid
z1-z2 combinations is given by detZdzNum which is 2080 and the number of valid
c1-c2 combinations is cryCdVNum which is 400 201 = 80,400. Both these numbers are
used extensively in the simulations and their product, 167,232,000 is the maximum number
of different valid LORs that can occur in the detector simulation; this is the value of Nl in
Eq. 8.1. Using this compression reduces the map file size by nearly a factor of 4, and this
method is used in our code where we refer to it as zdz format.
8.3 Simulating a PET Scanner 247
The symmetries in the detector mean that we can fully simulate the system using a single
voxel from each ring. In our simulation we will use the 100 voxels with integer coordinates
cv = zv = 0 and rv in the range 0 – 99. Note the ranges for cv and zv match the
arrangement of detectors in the physical scanner but the rv range is simply chosen to give a
good coverage of the useful volume of the scanner.
As explained above, lines of response are specified by four coordinates (z1,c1,z2,c2)
which define the detectors at the start and end point of the LOR. The start point of a LOR is
located at detector (z1,c1) and at the end point (z2,c2) where z1 ≤ z2. The symmet-
ries of the system mean that any two LORs with the same values of dz = z2-z1 and dc =
c2-c1 (mod 400) will have the identical values for the system matrix element for
corresponding voxels. We can think of a particular pair of (dz,dc) values as defining a
“base-LOR” (BL) from which many actual detector LORs are derived by rotation (400
positions) and translation (32 positions on average depending on the value of dz). A typical
derived LOR would be defined by (z1,c1,z1+dz,c1+dc) where the final addition is
understood to be modulo 400. Since there are 400 choices for c1 and 64-dz choices for z1,
this is on average about 400 32 = 12800 detector LORs coming from each unique BL. The
number of unique base BLs is 64 201 = 12462 where the number of c2 values (201)
depends on our convention that 100≤dc≤300.
In the fullsim simulation program we find all LORs going through a polar voxel
having coordinates cv=0, zv=63 and rv=n where 0≤n≤99. We use an overlong scanner
with 127 rings so that the dz range of the generated LORs can reach the maximum allowed
value of 63 in the real scanner with 64 rings. These LORs have the same rotation and
translation symmetries as the base-LORs discussed above except that they also have an
associated voxel with a definite position. The associated voxel must move with the base-
LOR under any transformation. We can treat the LORs found by fullsim as “base–LORs
with voxel” (BLV) defined in Eq. 8.2,
Base LOR with Voxel: ðzsm1; c1; zsm2; c2Þfrv ¼ r;cv ¼ 0;zv ¼ zsm1g (8.2)
where zsm1 is the displacement from the start of the LOR to the voxel and zsm2 is the
displacement from the voxel to the end of the LOR. The extent of the LOR along the z-axis is
then dz=zsm1+zsm2. The associated decay voxel coordinates are shown in the braces{}.
In the code the voxel r value is implicit because voxels with the same r value are grouped
together. The results of the fullsim program are the probabilities that a decay in the voxel
is detected in the set of base-LORs associated with that voxel. These probabilities are
invariant when the BLV is translated along the z-axis or rotated in the x-y plane. The set
of LORs derived from the BLV defined in 8.2 is shown in Eq. 8.3.
ðz; c1 þ c; z þ dz; c2 þ cÞfr; c; z þ zsm1g
(8.3)
where dz ¼ zsm1 þ zsm2 and 0 ≤ z ≤ 63 dz and 0 ≤ c ≤ 399:
• Lines 28–31: Here we define a region of interest (ROI) within which decay events will be generated.
The volume is specified as ranges of cylindrical polar and z coordinates. The r range is r.x to r.y
and similarly for the polar angle phi and axial distance z.
The heart of the code is the kernel voxgen which generates decays in a given scanner
voxel or other ROI and tracks the resulting back-to-back gammas to the points where they hit
the inner surface of the scanner. The gamma-pair directions are generated isotopically in 3D
and the decay positions are generated uniformly within the voxel volume. Such event
generation is fundamental to many simulations. The voxgen kernel code is shown in
Example 8.2 and a supporting kernel function ray_to_cyl in Example 8.3.
• Line 40: The templated kernel function voxgen is the heart of the fullsim program, it generates a
large number of decay events at random uniformly distributed points inside a voxel. The back-to-
back gammas from these events are projected to the surrounding PET detector rings and the hit
points are saved as a LOR. We expect to generate 1011 or more LORs for each voxel; thus we
accumulate totals for the number of times sanet.st
each LOR occurs rather than just outputting individual
events.7 The arguments used by voxgen are as follows:
○ The array unit *map which is used to store the generated LORs. This is a large array of
dimensions (zNum × cyrNum)2 which in our case is about 2.6 GB of GPU memory. In the
code this array is addressed as the 4D array hits[z1][c1][z2][c2]. This choice is an
important design decision for our code and rules out low end CUDA GPUs with smaller memories
but yields fast straightforward code.
The map array is used to hold integer numbers of hits; hence the choice of uint for the array
type. This allows us to accumulate counts of up to 232‒1 for individual LORs whereas a float
type is good only for up to ~225. The choice of ushort was also considered to save memory but
the need to check for and deal with potential overflows in some bins would slow down and
complicate the code. Also, CUDA support of 16-bit atomicAdd is only available in the latest
GPUs (compute capability 7.0 and above).
This array is actually bigger than it needs to be, the size could be reduced by a factor of nearly
four by exploiting the constraints z1 ≤ z2 and dc ≤ 201. These constraints are used in the
MLEM reconstruction code, reco, discussed later in Section 8.5. Here we are addressing
elements of the very large map array randomly using results of a Monte Carlo simulation – this
is a worst-case scenario for CUDA and it is not clear that adding extra code to compress the array
would be worthwhile.
We also note that while performance is important for the fullsim program, it is only needed
once to generate the system matrix. Thus, it is better to spend our time optimising the later
reconstruction code which will be run many times and for clinical applications needs to be as fast
as possible.
○ The array double* ngood, has size equal to the total number of threads in the grid of thread
blocks and is used to store the total number of good events found by individual threads. The type
8.3 Simulating a PET Scanner 251
double is used for storing individual thread’s contributions in global memory so that the final
thrust::reduce operation used to find the grand total of good hits is done with adequate
precision, bearing in mind that the total might exceed the maximum value of the uint variables
used to accumulate the contributions from individual threads.
8
○ The struct Roi roi which defines the volume within which decays are to be generated. This
struct is defined in scanner.h and is shown in Example 8.1(a), it contains the float2
variables r, phi and z. The x and y components of each of these members specifies the range of
values to be used in the simulation, for example, the range of radial values is roi.r.x ≤ r ≤
roi.r.y.
○ The template array S *states is used to store the cuRAND states used by each thread as in our
• Lines 53–57: Here we generate a random point within the region specified by roi. The points need
to populate space uniformly; this is easy to do with cubes but harder for other objects. The inverse
transform method discussed in Chapter 6 is needed to do this correctly for the radial coordinate. The
voxel r range goes from roi.r.x to roi.r.y but because of the wedge shape we need more
points for larger values of r than for smaller values of r. The formula we need is actually Eq. 6.3 in
Chapter 6 and involves the square root of a uniformly generated random number.
○ Line 53: Set phi to a uniform random value in desired range [roi.phi.x,roi.phi.y]
○ Line 54: Set r to a random value in the desired range [roi.r.x,roi.r.y] but biased so that
large values occur more often and the resulting spatial distribution is uniform, Eq. 6.3 is used here.
○ Lines 55–56: Store the Cartesian x-y coordinates corresponding to r and phi in g.a.
○ Line 57: Store a z coordinate in g.a where z is generated uniformly in range [roi.z.x,
roi.z.y].
• Lines 58–64: Simulate the production of a pair of back-to-back gamma rays at the decay point by
generating a normalised 3D direction vector g.n. This direction is also random and we use
appropriately generated random spherical polar coordinates ðθ; ϕÞ for this purpose. In these coordin-
ates an area element on the unit sphere (or solid angle) is sin θ dθ dϕ:
○ Line 59: Generates phi_gam uniformly in ½0; 2π
○ Line 61: For theta_gam but we must again use the inverse transform method of Chapter 6 to
deal with sin θ. Since the integral of sin θ is cos θ we need to generate u uniformly in ½1, 1 and
find cos1 ðuÞ.
○ Lines 62–64: Store the Cartesian components of the direction vector in g.n.
• Line 66: Find the two points where the random Ray g, constructed by the previous steps, meets the
inner surface of the scanner. The calculation is done by calling the function ray_to_cyl with
arguments g, lor and detLongLen (the detector length in mm, 508 in our case). If the ring
difference between the hits points is found to be less than zNum (64 in our case) the function stores
252 Application to PET Scanners
the calculated (z,c) pairs in the second argument lor and returns 1 to indicate that a detected event
has been generated. Lines 67–71 are then executed for good hits.
• Line 67: Increments the local variable good to keep track of successful hits. This is important as it
will allow us to calculate the efficiency of the PET system for detecting gammas from the ratios of
good hits to all generated gamma pairs.
• Lines 68–69: reformat the z1 and z2 LOR positions from absolute values in the long scanner to
displacements from the source voxel at z=zNum‒1. This zsm1 and zsm2 format used in the
system matrix.
• Lines 70–71: Increment the appropriate element of the map array using atomicAdd in line 71. The
index mimics a 4D map[zms1][c1][zms2][c2] layout. Note what we are doing here is
effectively creating a frequency histogram of how often each possible LOR is generated for the
particular voxel geometry used. Histograming is a worst-case scenario for GPU memory access as
all 32 threads in any warp could be accessing 32 widely different memory locations. There is no
chance of nicely coalesced memory access patterns here. On early GPU generations elaborate
schemes, involving buffering in shared memory and other tricks, might have been used to try and
improve performance. Fortunately, modern GPUs are more forgiving and we think it is best to take
the performance hit on the chin and keep your code simple.
• Lines 74–75: For each thread store the final values of the local variables good and state in the
global arrays ngood and states. Notice we use += for ngood so that values are properly
accumulated over repeated calls to the kernel.
A listing of the function ray_to_cyl which is used by voxgen to find the points at
which a ray meets the cylindrical detector barrel is shown in Example 8.3.
sanet.st
92 if (z1 >= 0.0f && z1 < length && z2 >= 0.0f &&
z2 < length && abs(z2-z1) < detLen){
93 float x1 = g.a.x+g.lam1*g.n.x;
94 float y1 = g.a.y+g.lam1*g.n.y;
8.3 Simulating a PET Scanner 253
• Lines 83–85: Calculate the parameters A, B and C for Eq. 8.4; note the factors of 2 and 4 shown in
the equation are not present in our code because they cancel in the final answer.
• Lines 86–87: Calculate the square root (radical) used in the solution of a quadratic equation. Since
the point a is always inside the cylinder, we know the quadratic equation will always have real roots
so D must be positive and no check is needed before taking the square root. In a more general case,
where a might be outside the cylinder, a check would be necessary as negative values of D occur
when the ray misses the cylinder.
• Lines 88–91: Calculate the lam1 and lam2 values for the hit points using positive D for lam1 and
negative D for lam2. The results are stored in g. The z coordinates, z1 and z2, of the hit points are
also calculated here.
• Line 92: Checks that z1 and z2 are both inside the finite length scanner, that is in the range
[0,length] (length is 508 mm in our case) and that the difference between their values is
consistent with the length of the short physical scanner (256 mm in our case). LORs which have
a small inclination with respect to the z-axis will fail this test.
sanet.st
• Lines 93–97: Calculate the integer detector z1 and c1 values at the point given by lam1. Two small
utility functions myatan2 and phi2cry shown in 8.3 are used for this purpose.
• Lines 98–102: The same calculation for the second hit point.
• Lines 103–106: Here we swap detector elements if z1 > z2 to ensure z1 ≤ z2 in the
subsequent processing.
• Line 107: Returns 1 to indicate success.
• Line 109: Returns 0 to indicate failure, meaning that one or both gammas escape the scanner.
• Lines 121–126: This is the support function myatan2. It is based on the standard Cþþ atan2
math library function but is called with the arguments reversed, therefore, (x, y) indeed of the usual
(y, x). It returns an angle following our PET convention, 0 for a vector parallel to the y axis and
increasing with clockwise rotation up to 2π. It is good programming practice to isolate special
conventions like this in a single function because then changes are easily made. If the code were
repeated in multiple places, bugs are more likely if you change your convention. A particularly nice
feature of CUDA is that we can use literally the same function on both the host and device, further
reducing the chances of bugs.
• Lines 128–132: The function phi2cry converts the angle phi in radians to a detector number in
the range 0–399 following our numbering convention for the detectors as shown in Figure 8.4.
The host code, which is not shown here, is straightforward and follows the usual pattern of
our examples: read parameters, allocate and initialise memory buffers, call kernels, collate
results and finish up. To compare the performance of our GPU code to similar CPU code we
can replace the kernel voxgen with some similar host code running on a single CPU thread.
Some typical results are shown in the box.
8.3 Simulating a PET Scanner 255
The GPU version of fullsim runs on an RTX 2070 GPU and is capable of generating
about 5 109 events per second with isotropic orientations. About half of these interact with
the PET detectors for our specific geometry. This information itself is interesting as it allows
the user to calculate the geometrical efficiency of the PET scanner as a function of voxel
position. The detailed information in the output file will let us go further and create the full
system matrix. The same code running on a single CPU of the host PC is well over
1000 times slower. This is in spite of the fact that analysis of the GPU performance with
Nsight Compute suggests the GPU code is limited by memory access and delivering “only”
about 450 GFlops.
The entire system matrix requires 100 such generations corresponding to an overall
generation time of only a few minutes of GPU time for reasonable statistical accuracy.
Note again that the use of polar voxels is critical here. An equivalent 200 200 Cartesian
grid has 8-fold symmetry and would require about 5000 separate runs of fullsim, a
50 fold increase in both computing time and disk space for results.
If we were to write the full 2.6 GB output map array to disk, the disk IO would take
significantly longer than the time required to generate its contents. The program does have an
option to do this because these files are useful for debugging. For each fixed (c1,zsm1)
pair one slice of the map is a 2D plot of all the associated (c2,zsm2) hits. Now all these
LORs must pass through both the 4 4mm area of the first detector and the fixed source
voxel of appropriately 2 mm3 and this restricts their possible directions to a narrow cone
diverging from the source voxel towards the second detector. The divergence of these cones
256 Application to PET Scanners
Figure 8.5 PET detector spot maps for second gamma from LOR
increases as the distance between the source voxel on the first detector decreases. Two
examples corresponding to the extremes of spot sizes are shown in Figure 8.5.
Figure 8.5 shows spot maps with (c2,zsm2) distributions for fixed (c1,zsm1)=(6,5). The
zsm2 axis is vertical and the c2 axis is horizontal. In 8.5 (a) the source voxel is near the
scanner surface with 198<r<200 mm and 0<ϕ<7.5; here the spot is large with 278 non-zero
values centred at (c2,zsm2)=(247,35). The peak value is 95518 for 1010 generated decays. In
8.5 (b) the source voxel is near the scanner centre with 0<r<2mm and 0<ϕ<7.5; this time
the spot is small with six non-zero values centred at (c2,zsm2)=(206,5). The peak value is
182121 for 1010 generated decays. The “lever arm”, zsm2/zsm1, is equal to 7 for (a) and
1 for (b) which accounts for the difference in spot sizes.
Apart from debugging, it is unnecessary to output the entire 400x64 (c2,zsm2) array
for each fixed (c1,zsm1) pair. In fact,sanet.st
a square of dimension 24 24 is big enough to
accommodate each “spot”. It is easy to calculate the centre of the spots using the known
positions of (zsm1,c1) and the source voxel. The calculation is similar to that of Eq. 8.4
except that as it is known that one of the points on the desired ray is already on the surface of
the scanner, we only need to solve a linear equation; the result is shown in Eq. 8.5 in the box.
(8.5)
In order to compress the output files, a kernel, find_spot, was written which reduces the
volume of data that needs to be transferred back to the host and written to disk from 2.6 GB
to 59 MB. The kernel is shown in Example 8.4. This kernel is not particularly elegant nor
particularly efficient; rather it demonstrates that, once one knows a bit of CUDA, it is
relatively easy to write kernels to do specific programming tasks for which a bespoke library
8.3 Simulating a PET Scanner 257
function is unlikely to exist. Since the kernel has to process 400 64 separate frames of size
also 400 64 we allocate one CUDA thread per frame by using 64 blocks of 400 threads.
Having made this design choice no further tuning of launch block sizes is possible.
○ Line 138: This performs the linear interpolation c ¼ a þ λðb aÞ using the utility function lerp
defined in the CUDA file helper_math.h.
○ Lines 139–141: Here we calculate the integer detector coordinates (z2,c2) for the mid-point of
the “spot”.
• Lines 163–165: Convert the z1 and z2 measured from LH end of scanner to the source voxel
displacements zsm1 and zsm2 used in indexing the map and spot arrays. Note we clamp zsm2 to
the allowed range [0,zNum-1] and continue. This is because although zsm2 might be out of the
valid range for the scanner, some of the halo LORs in the associated spot could still be detectable.
• Lines 166–167: Find indices to the relevant (zsm1,c1) slice in map and spot for the
current thread.
The rest of the code concerns copying the correct tile from map to spot.
• Lines 168–178: Loop over 24 24 tile centred at the calculated hit point (zsm2,c2) in map and
copy values to spot. For index calculations involving the angular c coordinate we use modular
arithmetic implemented by the utility functions cyc_inc and cyc_sub.
• Lines 180–181: Finally we store the position of the tile in the first two elements of the first column of
the tile in spot. There is a tacit assumption here that the tile is big enough to prevent any useful
data being overwritten. Obviously a more robust strategy could be used, for example, writing these
numbers to a separate file, but here we want to keep our code as simple as possible.
• Lines 190–201: A set of small support functions to perform modular addition and subtraction with
respect to cryNum. Since cryNum has been defined as constexpr in scanner.h its value is
known at compile time aiding the compilation of efficient inline functions.
The host code used to call these kernels is not fully discussed here, but it is straightforward
and can be found in our repository. Basically, the user can specify which subset of voxels to
simulate and the dimensions of each voxel. Most of the parameters used by the simulation
are actually defined in the scanner.h file and changes to these parameters would require
recompilation of the whole program.
Each voxel simulated is saved in a separate binary file comprising a stack of 400 64
24 24 tiles. These files are named spot_map000.raw to spot_map099.raw for the
full set of 100 voxels at different radii. On my system using 1011 generations per voxel (a lot!)
all the files can be generated in about 36 minutes. Using an older GTX 970 GPU is about 2.5
times slower, needing about 90 minutes. Interestingly our results for the RTX 2070 card are
about 160 times faster than results I was able to achieve using a cluster of 32 dual processor
Xeon 3.2 GHz PCs for a similar project in 2007.9 That system was not cheap to buy or run; it
was noisy, required a full-time system manager and lots of air conditioning.
The full data set is still quite large at just under 6 GB. However, there are still a substantial
number of zeros (most spots are much smaller than 24 24) or tiny values in the files. Since
these can be discarded when building a useful system matrix, this size can be further reduced.
for the voxel with c=0, z=zsm1 and r=rs where rs in [0,99] depends on which of the 100
spot-map files is used. The integer values in the spot-map files need to be divided by the total
number of decays, Ngen used in the simulation to obtain the actual probability. As an implemen-
tation detail, in our code the scaling by Ngen is performed by the reconstruction code which reads
unscaled values stored in the system matrix.10
Combining the 100 spot-map files into a single system matrix is relatively quick and
straightforward; we simply concatenate all the spot-map files into a single file and compress
the data to minimise storage requirements. We compress the four integer values defining a
base LOR into a single 32-bit word using seven bits for the z values and nine bits for the
c values as previously illustrated in Figure 8.3.
In the code we refer to the compressed values as “keys” and for each spot-map file the
keys are sorted into ascending key value order when stored in the system matrix file. This is
an important detail.
The actual system matrix file is a stack of smPart objects which hold the key values and
associated number of decays converted to a 32-bit float.11 The definition is shown in
Example 8.5 together with functions for packing and unpacking keys.
Example 8.5 smPart object with key2lor and lor2key utility functions
struct smPart {
uint key;
float val;
};
uint key = 0;
key = (sml.zsm1<<25) | (sml.c1<<16) | (sml.zsm2<<9)
| (sml.c2);
return key;
}
A straightforward host code program readspot is used to convert the 100 spot-map files
into a single system-matrix file; the code is available in our repository. The outputs from this
program are two files. The first file, sysmat.raw, is the system matrix itself and it contains
all the smParts from valid lines of response in the spot-map files. Importantly, the entries
are sorted first in ascending order of the radial voxel position and then the data for each radial
subset are sorted into ascending key order. The sorting by key affects memory access
patterns in the reconstruction code. Using sorted keys has the consequence that zsm1 and
c1 vary more slowly than zsm2 and c2 when stepping through sysmat.raw. Kernel code
is constructed with an awareness that array indices that zsm1 or c1 correspond to large
strides and zsm2 or c2 correspond to smaller strides, thus threads in the same warp should
tend to use the same values for zsm1 and c1. This improves the memory caching
performance when running the kernels.
The readspot program can reject some of the LOR data in the spot-map files using user
settable cuts. In particular, the user can specify a maximum value for zsm1+zsm2, that is
the maximum scanner-ring difference for the LORs. This is useful in real-world scanners
where LORs with large ring differences give less precise information. The other cut is a
probability value cut on the LORs – we have found that there is a large tail of very small
values which contribute nothing to the accuracy of reconstruction. For each zsm2-c2 spot
we find the maximum value and then reject LORs with values less than a fixed fraction of
that value. For example, a value cut of 3% reduces the size of the sysmat file by 34%. After
such cuts the sysmat file holds about 6 106 base-LORs requiring just 48 MB of memory.
The second file produced by the readspot program is systab.raw, a small index
file containing pointers to the start and end of data for each radial subset in the sysmat file.
The systab file contains one smTab object per radial subset, the smTab structure is
shown in Example 8.6. The integer start points to the first entry in sysmat for this ring
and end is one greater than the final index for this ring. The other values are for possible
development and are not used at present, but they do pad the size of smTab to 16 bytes which
matches GPU L1 cache lines.
Example 8.6 smTab structure used for indexing the system matrix
struct smTab {
int ring; // voxel radial position
uint start; // pointer to sysmat start of data
uint end; // pointer to sysmat end of data
int phi_steps;
};
262 Application to PET Scanners
activity in LOR l. This term is the backward projection of the detected counts in LORs to
activity values in voxels. The summation is performed by the kernel backward_
project shown in Example 8.8 which also uses an outer loop over l values. P
3. The final step involves multiplication of the current activity estimate anν by BPnv = Nl0 l S v l0
where the denominator is a normalising constant depending only on v and needs only to
be calculated once. This step is performed by the kernel rescale also shown in
Example 8.8.
Note both projection kernels use thread block sizes reflecting the angular symmetry of our
polar voxel design – 400 in this case. This means that each sysmat element is only processed
by one thread block which handles the phi symmetry by using one thread per phi value and
these threads then loop over the possible z displacements. With this design the kernel code
required to implement Eq. 8.1 is straightforward and is shown in Examples 8.7–8.9.
• Lines 1–3: The support function c2_to_dc2 calculates a dc index from c1 and c2. The difference
c2-c1 is calculated modulo 400 going clockwise from c1 to c2 and then subtracting 100 to get an
index in the range [0,200].
264 Application to PET Scanners
• Lines 4–6: The support function zdz_slice which returns the starting index for a given dz value
for the compressed LOR file format.
• Line 10: The declaration of the forward_project kernel; its arguments are as follows:
○ sm – this is the system matrix passed as a device pointer to an array of smPart objects. In our
voxel ring.
○ a – a pointer to the float array holding the current estimate of voxel activities an . The voxel array
ν
size is radNum × phiNum × zNum = 100 400 64 = 2.56 106.
○ ring – an int value of the voxel ring to be processed on this call; it is needed when indexing the
voxel array a and its value is implicit in the range of sm elements used.
○ FP – an output array that holds the forward projected values. The forward projected detected
LORs due to the current activities in a. This array is initialised to zeros by the host and gets
incremented on each subsequent call for different rings.
○ dzcut and valcut – optional user defined cut values to skip some sysmat values for faster
• Lines 24–29: Here we loop over all the valid z positions for the translationally invariant
sm element. In the for loop the variable zs1 represents the position of the left-hand end of
the LOR and has the range [0,63-dz]. The corresponding positions of the voxel are
[tl.zsm1,63-tl.zsm2].
• Lines 25–26: Here we finally do the required calculation adding av Svl to FPl using atomicAdd to
perform the addition. The atomicAdd is necessary because many voxels contribute to each LOR.
Experiments suggest that on modern GPUs this use of atomicAdd does not have a strong effect
on performance.
• Lines 27–28: Here we increment the array pointers appropriately for the LOR and voxel z positions.
• Line 30: In the last line of the kernel we increment the smpos index by the number of thread blocks
to prepare for the next iteration.
In this kernel all the real work is done in lines 25–26 inside the for loop over zsm1; all the rest of the
code is a wrapper designed to loop over all the entries in the system matrix.
In Example 8.8, we show the code for the backward_project kernel which uses the
same wrapper code as the forward_project kernel but calculates BP as the second step
of the MLEM iteration formula.
statement we guard against the case of division by zero in cases when an element of FP is zero by
setting the local copy FPdiv to a minimum value of 1. This heuristic is based on the observation
that valid values of FP are always large.
○ Line 57: Sets the local variable element as the required contribution to BP.
The check in line 56 is necessary because on the GPU, division by zero silently sets the result to
nan and in our iterative calculation any nan elements in BP will then propagate throughout the
calculation. Zeros in FP can arise because some of the measured values in the array meas used in
line 57 can legitimately be zero.
• Lines 70–77: The Final rescale kernel needed to P calculate Eq. 8.1 is shown here. This kernel
simply scales each element av by the factor BPv = l S vl. The values of the denominator are
calculated once on the host and passed to the rescale kernel in the kernel argument norm.
sanet.st
8.6 Results
We have tested our code with simulated decays from a Derenzo rod phantom; these
phantoms are commonly used for studying PET reconstruction and consist of a number of
cylinders of varying diameters grouped into a hexagonal pattern as shown in Figure 8.6.
Figure 8.6 shows the cylinder pattern of the generated Derenzo phantom; the cylinder
diameters are 24.5, 17.1, 9.8, 8.6 and 6.1 mm and their lengths are 128 mm. The layout in the
transverse plane is shown in (a) and a 3D rendering is shown in (b). The LORs from
simulated decays in the phantom were generated using a modified version of the fullsim
program; the activity per unit volume was assumed to be constant everywhere at about
2.5 105 decays per mm3 amounting to a total activity of about 2.3 1010 decays. The
Figure 8.6 Derenzo Phantom transverse and 3D views and generated counts per LOR
8.6 Results 267
Figure 8.7 MLEM iteration time as a function of the number of thread blocks
number of detected decays in the list-mode data set used for reconstruction was about
8.25 109 corresponding to a geometric detection efficiency of 36% for this object. Over
half the LORs in the data set had zero detected counts. Figure 8.6 (c) is a histogram of the
numbers of detected decays in each LOR; about 60% of the possible LORs have zero counts;
the maximum number of counts in a single LOR was 504. The grey distribution corresponds
to a log-scale.
Our initial testing showed that while the code worked and gave potentially excellent
reconstructions the time required was disappointingly large. For a launch configuration of
1024 thread blocks with 400 threads per block, each iteration required 12.7 seconds. About
50–100 iterations taking 10–20 minutes were required to give high-quality results. Although
this was about 40 times faster than the equivalent code running on a single CPU, the GPU
was running at well under its potential – the code was clearly limited by memory access.
After spending some time experimenting, unsuccessfully, with different memory organisa-
tions we tried optimising the number of thread blocks while keeping the number of threads
per block fixed at the design value of 400. It turned out that our standard assumption, that
1024 blocks was about right for most problems, is wildly wrong for this problem; the
optimal number is around 156,000! Figure 8.7 shows how the execution time per iteration
varies as a function of block size. Thus, simply using the optimal number of blocks reduces
the time per iteration to about 3.9 seconds; a speed-up of a factor of 3.3.
We can estimate the performance of the code by noting that the inner loops of the forward
and backward projection kernels contain about 20 arithmetic instructions including one
division and two atomic adds. These inner loops over zs1 are executed an average 32 times
per system matrix element. For this test we used a system matrix with about 12.8 106
elements and each element is used by 400 threads. Thus, the total number of arithmetic
operations performed in one iteration is about 20x400x32x12.8 106 ~ 3.3 1012, or
just under 1012 operations per second which is reasonable for a memory intensive problem.
This is an excellent example of the GPU hiding memory latency when it has enough
threads. This result is for an RTX 2070 GPU; other architectures may have different optima.
For comparison, running the same code on a Maxwell generation GTX 970 GPU gave an
optimal execution time of 23.5 seconds per iteration, which is eight times slower. This card
also required a large number of thread blocks for best performance. Here we found about
50,000 thread blocks gave the best results but the improvement over using 1024 thread
268 Application to PET Scanners
blocks was only a factor of 1.2 or 20% compared to the 330% speed-up obtained with the
RTX 2070. This is an unexpected and dramatic difference between cards that are only two
generations apart.
anv X
Nl
ml S vl
avnþ1 ¼ for subsets b ¼ 1, 2, . . . , B: (8.6)
P
Nl P
Nv
S vl0 l2Set b anv0 S v0 l
l 0 2Set b v0 ¼1
In this equation it is understood that in successive iterations the subsets are processed in
order so that after B iterations all the measured LOR data has been used. We will refer to the
result of B such iterations as a full OSEM-B iteration. The idea is that one full OSEM-B
iteration should take the same time as a single MLEM iteration but converge a factor of
B times faster. In practice, while OSEM works well for early iterations it may not ultimately
sanet.st
converge to quite the same final state as MLEM. The number of subsets, B, is known as the
OSEM factor and relatively high values such as 16 are widely used in the PET community.
OSEM can be easily implemented in our code, with no changes to the kernels, by
appropriate partitioning of system matrix elements into subsets and using the host code to
set the kernel arguments smstart and smend to span one subset on each iteration.
Similarly, the norm pointer used by the rescale kernel is set by the host to the appropriate
subset sum on each iteration. However, our method, which exploits rotational symmetry,
means that our definition of the OSEM subsets requires care. For the OSEM method to work
it is important that each subset of LORs covers the entire region of interest in a fairly uniform
way; normally this is done using parallel projections which are sets of LORs parallel to each
other in the transverse x-y plane. In standard PET image reconstruction, each such parallel
projection can be used as an OSEM subset.
In our approach, however, each thread in a thread block processes the same LOR but
rotated to 400 different angles in the x-y plane. Moreover, each physical LOR occurs in
potentially many system matrix elements with differing voxel radii or c1 offsets. Helpfully,
in our implementation the value of c1+c2 obtained from the system matrix key defines the
angle of a LOR in the transverse plane and this is different for all threads in a thread block
which add an integer between 0 and 399 to both c1 and c2. However, all threads in a given
thread block will agree on whether c1+c2 is even or odd – we can use this to divide our
system matrix into two mutually exclusive subsets for B=2 OSEM processing. To go further
we can use the even or oddness of c1 itself to subdivide both of these subsets for B=4
OSEM or the value of c1%m for B=2m subsets. Using c1 in this way ensures that all
instances of a particular physical LOR in our system matrix are mapped to the same subset
8.7 Implementation of OSEM 269
Figure 8.8 PET reconstruction results for MLEM and OSEM with an RTX 2070 GPU
but does not ensure complete coverage of the region of interest, particularly for larger
B values. OSEM-1 is clearly identical to MLEM and, in practice, we find B=2, 4 and 8
gives good convergence, but higher B values do not. The execution times for full OSEM
iterations are 3.8, 4.6 and 5.9 seconds for B =2, 4 and 8 respectively. Some results for the
Derenzo phantom are shown in Figures 8.8 and 8.9. For OSEM factors above 8 we find that
reconstructions start to lose quality and the time per full iteration increases.
The fact that good quantitative and excellent qualitative image data can be reconstructed
in less than one minute using iterative methods with OSEM factors up to 8 is very helpful in
some clinical applications. The fact that this can be done using inexpensive commodity
GPUs compatible with a clinical environment is also significant.
Figure 8.8 shows a row of images for a central slice of the Derenzo phantom. The top row
shows original activity distribution, the second row shows a reconstruction with 5 MLEM
iterations (15.8 secs), the third row shows a reconstruction with 5 OSEM iterations B=8
subsets (24.3 secs) and the final row shows the reconstruction after 100 MLEM iterations
(312.5 secs). The four images in each row are firstly a slice in the x-y plane, secondly a slice
270 Application to PET Scanners
in the x-z plane, thirdly a slice in the y-z plane and finally a profile of counts along a
horizontal line in the x direction. The y-position of the x-z slices in column two and the
profiles in column four is indicated by the horizontal arrows shown in the top left image. The
x-position of the y-z slices shown in column three is shown by the vertical arrows in the top
left image.
All the host code required to run the kernels described in this section can be found in our
code repository. So far we have only simulated an idealised PET gamma detector; for more
accurate results it is necessary to do more sophisticated simulations as discussed in the next
section. Interestingly, while these complications effect the size of the system matrix and the
time taken to generate it, the reconstruction code shown here does not need changing.
sanet.st
λattn is a property of the medium called the attenuation length. Many clinical PET systems
use BGO (bismuth germanate) for their detectors; BGO has as an attenuation length of ~10.1 mm.
A gamma ray entering perpendicularly into a 20 mm deep layer of this material has an 86%
chance of interacting. For PET both gammas need to be detected with a joint probability of 74%.
This is a reasonable value for clinical PET systems. In practice most gammas enter the material at
an oblique angle (as in Figure 8.9) and hence have greater change of detection albeit at the price of
possible DOI errors if a system matrix based on entry points is used.
We can straightforwardly modify our fullsim simulation to include DOI effects
resulting in a physically accurate system matrix which is then used by the unmodified
reconstruction code. The detectors used for PET use arrays of dense scintillating material
to absorb gammas and then emit light which is detected by photomultipliers. Two commonly
used scintillators for PET are LSO and BGO which have attenuation lengths of 11.4 and
10.4 mm respectively. These values are defined in scanner.h as LSO_atlen and
BGO_atlen. The probability distribution function for a gamma to interact after traveling
a distance x through the material is:
1
PðxÞ ¼ ex=λ attn , (8.7)
λ attn
and the probability that the gamma interacts before exceeding the distance x is
ðx
1 x0 =λ attn 0
f ðxÞ ¼ e dx ¼ 1 ex=λ attn : (8.8)
0 λ attn
Applying our inverse transform method from Chapter 6 to this function gives the following
formula for generating values distributed like f(x):
x ¼ λ attn lnð1 UÞ, (8.9)
where U is a uniform random number in [0,1].
To simulate depth of interaction we model the PET detector array as a cylindrical shell of
material with inner radius detRadius and outer radius detRadius + cryDepth
where cryDepth is just the thickness of the material in the radial direction. These constants
are defined in scanner.h, where the additional constant doiR2 is defined as the square of
the outer radius.
Example 8.9 shows the modifications to the voxgen and ray_to_cyl kernels from
Example 8.1 needed to include DOI.
• Line 80: Notice there is an extra argument – the cuRAND state object &state. The argument is
passed as reference because it is important that the caller receives the updated version of state
after the function call.
• Lines 88.1–88.5: These have been inserted to add an interaction point sampled from an exponential
probability distribution along the first gamma’s path inside the detector material and check if it
escapes before interacting.
○ Line 88.1: Generates a random step from an exponential distribution with mean BGO_atlen;
and they share the same direction vector g.n; therefore, one of g.lam1 and g.lam2 will be
positive and the other negative. The conditional assignment here caters for either possibility.
○ Lines 88.3–4: Calculate the x and y coordinates, x1 and y1, of the gamma interaction point in the
outer radius of the detector. The square of this radius is stored in consexpr doiR2 to avoid
having to take a square root here. If the point is outside the detector the gamma has escaped and
we return the failure flag.
• Lines 90.1–90.5: These perform the same interaction point check for the second gamma.
• Lines 91–110: These are the same as Example 8.3. Note the z1 and z2 coordinates now also depend
on the generated interaction points.
Our tracking of gammas through material using samples from an exponential distribution
is typical of the code used in many simulations of radiation transport problems. An important
feature of our kernel code is that the only thread divergence occurs when a thread exits
because its gamma misses the detector or exits though the outer surface of the detector ring
without interacting. However, we have still simplified the geometry and the next sections
present some ideas for the simulation of more complex cases. As the geometry becomes
more complex, gammas have more opportunity to follow different paths and preventing
thread divergence becomes harder. The code presented in the next section is just one
approach, we do not claim it is necessarily the best.
For a full block simulation, we have to trace the paths of individual gammas first to a
particular block and then through that block. The latter is actually a standard problem for all
ray tracing code: the basic geometry is shown in Figure 8.11.
The block is aligned parallel to the coordinate axes and is defined by its corner points p
and q. The normals at the faces point outwards and are labelled ux, uy and uz for the faces at
p and vx, vy and vz for the faces at q. A typical ray is shown entering at the lower face and
leaving at the upper face.
A straight line or ray in 3D space is defined by the equation r ¼ a þ λ n where r is the
position vector of a point on the ray, a is a fixed point on the ray, n is the direction vector of
the ray and λ is a parameter measuring the distance of r from a. If n is a unit vector then λ
measures actual distance otherwise the distance is λjnj.
A plane in 3D space is defined by the equation r m ¼ d where r is the position vector of a
point on the plane, m is the normal to the plane and d is the perpendicular distance of the
plane from the origin.
To find the points where a LOR defined by the ray r ¼ a þ λ n meets the planes in
Figure 8.11, we need to find values of λ that are solutions of the equations r m ¼ p m or
r m ¼ q m where p and q are the block corners indicated in Figure 8.11 and m is the
normal to one of the planes touching that corner We can solve these equations by noting that
for the common point:
ða þ λp nÞ m ¼ p m ) λp ¼ ðp aÞ m=n m and
(8.10)
ða þ λq nÞ m ¼ q m ) λq ¼ ðq aÞ m=n m:
In our case m represents one of the ui or vi normals to the faces of the block; these normals
are all parallel to the coordinate axes thus only one component of the vector m is non-zero.
Hence, we can express the six solutions as
λpi ¼ ðpi ai Þ=ni and λqi ¼ ðqi ai Þ=ni for i ¼ x, y, z: (8.11)
Notice the values of the non-zero components of ui and vi have cancelled so that the
directions chosen for these vectors (namely in or out of the block) did not matter. If ni is
zero, then the ray is parallel to the plane in question and there is no solution.
We can exploit the fact that the overloaded operators defined for the CUDA float3 data
type perform operations component-wise. Thus, if a ray is defined by the float3 vectors a
276 Application to PET Scanners
and n and a coordinate aligned block has corners defined by float3 vectors p and q, then the
expressions for λ in Eq. 8.7 can be evaluated in just two lines of code as shown in the box.
. . .
float3 lam_p = (p-a)/n; // λp 3-componensts for 3 p faces
float3 lam_q = (q-a)/n; // λq 3-componensts for 3 q faces
. . .
Fragment Showing Eq. 8.7 implemented in CUDA kernel code.
Unfortunately finding the 6 λ values is not enough; we also have to find which two points
on the ray are actually on a surface of the physical block and not on an extension of a plane
beyond the block. This means finding all six space points and checking that their coordinates
are in the right range.
Actually, the problem of finding if and where a ray meets a rectangular block is funda-
mental in ray-tracing applications. A block with its sides aligned with the coordinate axes as
in Figure 8.11 is often called an Axes Aligned Bounding Block (AABB) and such blocks are
used to build hierarchical trees of objects contained in the scene being rendered. It is easy to
show that if Eq. 8.7 are solved for a given ray and block then that ray will miss the block if
and only if all the components of lam_p are either greater than or smaller than all the
components of lam_q; otherwise the ray intersects the block. This test can be performed
using max and min functions. It is useful in situations where many rays miss the block.
For our GPU code it is not worth including this test because for each 32-thread warp it is
sanet.st
very likely that some of the threads will have gammas that do intersect with their target block
and those threads would simply be slowed down by such a test – thus the overall simulation
time would be increased even though some threads were doing less work overall. This is an
example of how a different mindset is needed for GPU program optimisation as opposed to
single thread CPU optimisation.
The rest of our calculation is not as elegant as the above fragment – from the six calculated
values of λ we have to find the two values λin and λout that correspond to the required entry and
exit points. This involves checking all six λ values in turn. Either no candidates will be found –
in which case the gamma misses the block – or two candidates will be found. In our simulation
the ray direction vector n is such that λ increases from zero as we move from the decay point a;
thus the smaller of the two candidates will be the entry point λin and the larger the exit point
λout. In addition, the difference between λout and λin is the path length travelled through the
block and this can be used for DOI probability checks. Our code is shown in Example 8.10.
equation b ¼ a þ λpx n.
○ Line 19: Check that py ≤by ≤qy and pz ≤bz ≤qz, therefore, that point is on left-side face of
the block.
13
○ Line 20: If point is on the face then update lmin using fminf.
○ Line 21: If point is on the face also update lmax and exit_plane; this requires an if
This function has a lot of if statements but there are no else clauses; hence there is a lot
of conditional execution going on resulting in idle periods for many threads but at least there
is no genuine thread divergence in the code. Minimising genuine thread divergence is a key
design aim of our simulation code. The code contains 6 evaluations of the form:
float3 b = g.a + lam.x * g.n;
Each of the three components of b requires one floating point multiply and one addition
making a total of 36 floating point instructions. However, in each case we use only two of the
three components of b so 12 instructions are wasted. This suggests it should be more
efficient to abandon the elegant vector expressions using overloaded operators to evaluate
all three components and instead explicitly evaluate just the required components, this would
involve replacing line 18 above with:
float3 b;
b.y = g.a.y + lam.x * g.n.y;
b.z = g.a.z + lam.x * g.n.z;
8.10 Block Detectors 279
and similarly, for the other 5 evaluations of b. This should reduce the instruction count to
24 floating point operations which is significant. However, before making this change, it is
interesting to look at the ptx code generated by the CUDA NVCC compiler for Example
8.10. The generated code for line 23 of Example 8.10 is shown in the following box:
We see that just two machine code instructions are generated, each of these instructions is
a fused multiply and add taking three inputs (g.a.y/z, lam_p.x and p.n.y/z) and
storing one output (b.y/z). This is the best possible code that can be generated; the unused
competent of b is NOT calculated and moreover a single fused multiply and add is being
used to combine the addition and multiply into one instruction. There is no need to change
our code from the elegant vector equations used and only a total of 12 instructions are
generated, not the 36 we feared and actually better than the 24 we expected after our
proposed optimisation. In this case the compiler is at least as clever as we are and, as is so
often the case, straightforward code is best.
The code in Example 8.10 is still verbose with a lot of error prone repetition. Example
8.11 is the same as 8.10 but replacing lines 18–47 with the lambda function lam_check
which is defined in lines 21–38 and then called twice in lines 40 and 41.
Example 8.12 shows the kernel function track_ray which calls the ray_to_block2
function and plays the same role as the ray_to_cyl kernel shown in Example 8.3 and
ray_to_cyl_doi in Example 8.9.
8.10 Block Detectors 281
○ Line 87: If we have actually passed through a block the value of path is adjusted down
appropriately as it is the total distance travelled through both blocks that will determine if this
gamma will be detected.
○ Lines 88 and 89: Here we use the sign of rg.n.x to determine if the gamma is traveling towards
the left or right-hand side of the AABB aligned block. If the gamma is travelling to the left, we
remake rg to use the block one to the right of the original so that the gamma can enter its right-
hand side and vice-versa for gamma travelling to the right of the original block. This is
slightly counterintuitive.
○ Lines 90–96: This is simply repetition of lines 79–85 to see if the gamma interacts in the new
Example 8.13 voxgen_block kernel for event generation in blocked PET detector
Program version Fraction of good events in simulation (%) Typical good events/sec
Ring 0 Ring 70 Ring 99
fullsim 46.9 48.1 52.7 2.3 109
fullsimdoi 32.8 36.9 42.8 1.8 109
fullblock 30.6 35.3 41.1 1.8 109
• Lines 133–134: Here we calculate the random distances, path1 and path2, to be travelled in the
detectors before generating a decay point.
• Lines 135–136: Here we call track_ray twice to see if the gammas will be detected or not. The
return values ex1 and ex2 are set to zero if their gamma is not detected.
• Lines 137–155: These lines are executed when both gammas are detected; they build a LOR to add
to the system matrix.
○ Line 138: The point p1 is set to the detection point for g1 using g1.lam2.
○ Line 139: The polar angle phi in the transverse plane is calculated here from the x and y
coordinates of p.
○ Lines 140–141: The LOR coordinates c1 and z1 are found here.
○ Lines 142–145: The LOR coordinates c2 and z2 for the g2 gamma are found here in the
same way.
○ Line 146: If the two points in the candidate LOR have a z difference less than the number of
detector rings, lines 147–151 are executed. If necessary the two ends of the LOR are swapped and
then appropriate element of map is incremented using atomicAdd.
• Lines 157–158: The number of good LORs and final random number state for the current thread are
stored in the global memory at the end of the generation loop.
The extra steps required to accurately simulate a blocked detector do not have much
impact on the rate at which events are generated by the code, the performance of the three
versions is shown in Table 8.2.
Note the efficiencies in Table 8.2 reflect the real-world efficiencies of the PET geometry
and detectors for capturing decay events and not the software. In fact, these efficiencies are
in themselves useful results and our fast simulations could be a useful aid in optimising
hardware design.
This essentially completes our discussion of the code for this section on PET; however,
there is quite a lot of additional work needed on the host code to support the use of blocked
detectors. Careful inspection of Figure 8.10 reveals that introducing block detectors breaks
the 400-fold geometrical symmetry of the system matrix. A 100 fold symmetry remains;
crystal detectors having the same number in this figure have the same values for their system
matrix elements. Here we are using rotational symmetry and reflection symmetry with
respect to the centres of the blocks. This means the system matrix will increase by a factor
of 4 in size and the reconstruction kernels become more complicated.
We have also neglected gamma scattering events where the angle of the ray might change
as it passes though material. For PET, the scatter of gammas in the detectors is not much of
problem. Scatter and absorption within the subject before the gamma pair reach the detectors
is an important issue, but that is another story beyond the scope of the present chapter.
However, here again simulation can play a big role.
286 Application to PET Scanners
Where K ij is the blurring filter which gives the contribution to qi from pixel pj , it is
understood that both i and j represent 2D ranges of x and y coordinates and the notation
fjg in the summation indicates that j covers a 2D neighbourhood of pixels centred on i. The
RL iteration method is then given by: sanet.st
Xq
pjnþ1 ¼ pnj i
K ij , (8.13)
fig
ci
Eqs. 8.7, 8.8 and 8.9 are clearly the same MLEM algorithm that we used for PET. The
summation in Eq. 8.8 is the backward projection from the measured image pixels qi to the
true image but with qi weighted by ci, as ci ! qi the ratio tends to one and the iteration
has converged.
Examples 8.14 and 8.15 show our implementation.
15 int iy = y+offset-ky;
16 iy = clamp(iy,0,ny-1); // clamp to edge values
17 for(int kx=0;kx<nkern;kx++){
18 int ix = x+offset-kx;
19 ix = clamp(ix,0,nx-1); // clamp to edge values
20 sum += p[nx*iy+ix]*kern[ky*nkern+kx];
21 }
22 }
23 q = sum; // qi is sum Kii*Pj over neighbourhood of i
24 }
30 __global__ void rl_forward(float *kern, int nkern,
float* p1, float *c,float *q, int nx, int ny)
31 {
32 int x = blockDim.x*blockIdx.x +threadIdx.x;
33 int y = blockDim.y*blockIdx.y +threadIdx.y;
34 if(x >= nx || y >= ny) return;
35 float f= 0.0f; // single element of c
36 convolve(x,y,kern,nkern,p1,f,nx,ny);
37 c[y*nx+x] = (abs(f) > 1.0e-06) ? q[y*nx+x]/f : q[y*nx+x];
38 }
• Lines 17–19: These organise the x loop similarly to the above for y.
• Line 20: The only line that does any real work – here we add the contributions for elements of p in
the neighbourhood of the target weighed by the appropriate kernel values.
• Line 23: Here we store the final result in q. Notice q is a scalar in this function; this is because
convolve is called separately by each thread in the kernel; in corresponding host code q would be
an array.
• Line 30: This is declaration of the rl_forward kernel which handles the forward projection of
Eq. 8.9. The arguments are the convolution kernel, (kern & nkren), the current estimated activity
p1, the output c array and q the measured image and finally the image dimensions.
• Lines 32–34: Find the point (x-y) about which the current thread performs convolution (i.e. the
index i in Eq. 8.9).
• Lines 35–36 Perform the convolution for Eq. 8.9 with the result being placed in the temporary
variable f.
• Line 37 Stores the result as qi =ci in the output array c. It is this ratio that is required for the next step
so this is more efficient than simply storing c.
• Line 40: This is the declaration of the rl_backward kernel which handles the backward projec-
tion of Eq. 8.8. The kernel arguments are the convolution kernel, (kern & nkren), the current
estimated activity p1, the improved estimate p2, the c array as computed by rl_forward and the
image dimensions.
• Lines 42–44: Find the image point (x,y) for this thread.
• Lines 45–46: Perform the convolution step corresponding to Eq. 8.8.
• Line 47: Saves the result.
sanet.st
Example 8.15 rl_deconv host function
Initially p1 holds the estimated image and afterwards p2 holds the improved estimate. It is
important to note that we must use two separate kernels here. This is because the calculation by a
single thread in rl_backward may depend on values of c calculated by multiple thread blocks
in rl_forward. Thus, the entire calculation of c by rl_forward must be complete before
rl_backward can safely commence.
○ Lines 63–64: These are the same as 61–63 except that the roles of p1 and p2 are interchanged.
Thus at the end of both pairs of calls, p1 contains the result of two iterations.
○ Line 66: Copies the final result from GPU to host.
Figure 8.12 shows results from running this example on some blurred text. The computa-
tion time taken to run the program on the GPU corresponds to about 360 109 fused
multiply and add instructions per second. Since there is also a relatively high demand for
memory, using vector-loading techniques might give a modest further speed-up. However, it
is very doubtful that using shared memory for the kernel would help due to the large halos
required for larger kernels. Compared to a single CPU on the host PC the GPU version is
about 450 times faster. It is also interesting to note that that image quality continues to
improve for up to at least 106 iterations. This might be helpful for difficult images.
290 Application to PET Scanners
sanet.st
We slightly cheated with this example in that we were able to use exactly the same kernel
for devolution of blurring as was used to create the blurred image. In a real-world case, the
original kernel might not be known, but this is exactly where the GPU speed-up helps. We
can quickly try lots of kernels to find which gives the best result. We could envisage
automatic searches for kernels that maximise image derivatives, for example. So-called
“blind-deconvolution” is an ongoing research topic and clearly GPUs have an important
role to play.
The original blurring operation in Eq. 8.7 is of course just the convolution of the image p
with the kernel K. There are other faster deconvolution methods available, for example,
taking ratios of the image and kernel in Fourier space. However, these methods are less
robust against noise than the RL method. More interestingly, standard convolution-based
methods assume that the filter K is position independent. MLEM-based methods do not have
this restriction and will work if K depends on the position of the image pixels p. The system
matrix used in our PET examples depends strongly on voxel positions. It would be easy to
extend our RL example to allow threads to calculate the elements of K when necessary. This
could be used to remove rotation smearing from long exposure images, for example, in
astrophotography where the arc lengths of star trails depend on their position on the image.
8.11 Richardson–Lucy Image Deblurring 291
In Figure 8.12 the images show some well-known text. These images have a size of 600
320 pixels. The top left image shows the original text after blurring with a truncated
Gaussian filter having standard deviation of 10 pixels embedded in a 15 15 pixel kernel.
The remaining five images show the results after 102, 103, 104, 105 and 106 iterations of our
GPU kernel. The times required were 204 ms, 434 ms, 2.4 sec, 22.1 secs and 230 secs
respectively. An RTX 2070 GPU was used for this example. The text is just readable after
only 100 iterations and excellent after 106.
This concludes Chapter 8; in the next chapter we look at techniques for getting even more
performance by using multiple GPUs.
Endnotes Chapter 8
1 Kaufman, L. Implementing and accelerating the EM algorithm for positron emission tomography. IEEE
Transactions on Medical Imaging 6, 37–51 (1987).
2 MLEM is essentially a linear method which assumes that the detectors record a linear response to the
received signals.
3 This is actually a commonly applied restriction. LORs with a small c1-c2 difference enter the detectors
at shallow angles and may penetrate several crystals before interacting. This leads to larger depth of
interaction blurring effects from these LORs.
4 Equipping readers with the necessary insights to efficiently map the particular details of their own
research problems into GPU code is my key motivation for writing this book.
5 Hudson, H.M. and Larkin, R.S. Accelerated image reconstruction using ordered subsets of projection
data. IEEE Transactions on Medical Imaging, 13(4), pp.601–609 (1994).
6 For clarity in this section we will use the defined values of the scanner parameters, e.g. 400, not the
parameter name e.g. cryNum. However, in the code the parameter names are always used.
7 The is a fundamental design choice in many simulation events. Here we need a very large array to
accumulate the totals for each for every possible LOR. In this case a histogram with ~6.5 108 bins,
however this is small compared to the space that would be required for say 1012 generated LORs. In
other simulations where the number of generated events is small compared to the number of event
configurations it would be efficient to save individual events rather than a histogram.
8 The acronym ROI is commonly used in medical imaging circles and means region of interest.
9 Ansorge, R. “List mode 3D PET reconstruction using an exact system matrix and polar voxels.” In 2007
IEEE Nuclear Science Symposium Conference Record, vol. 5, pp. 3454–3457. IEEE, 2007. Ansorge, R.
et al. “Very high resolution 3D list-mode PET reconstruction using polar voxels.” In 2008 IEEE Nuclear
Science Symposium Conference Record, pp. 4112–4114. (2008).
10 The overall normalisation of the in the system matrix turns out to cancel in the expression for MLEM
interactions, so in the code we normalise the probabilities to 106 instead of 1 to avoid tiny numbers in
S. The important point is that the same number of generation attempts is made for each spot file used.
11 Converting from unsigned int to float for int values > 225 will introduce rounding errors, but these are
completely negligible compared to the Poisson statistics on the actual values found in the simulation and
the sort of accuracy needed in real PET scans. We choose to use float because 32-bit floating point
calculations are faster on most NVIDIA GPUs. The most recent Turing cards have better support for
integer arithmetic.
12 A great deal of older PET literature discusses sinogram data rather than list-mode data. A sinogram is
simply a version of list-mode data where groups LORs are summed together to form a Radon transform
of the actual activity. Sinograms both reduce the size of the measured activity data set and enable fast
analytical approaches such as filtered back projection (FBP) for the reconstruction of the activity.
Unfortunately, these approaches are approximate, particularly for 3D data sets which use all or most
of the detected LORs not just those in the plane of a single detector ring. As modern computing
292 Application to PET Scanners
hardware becomes increasingly powerful computationally expensive methods based on the fully 3D
MLEM method are increasing used.
13 While developing this code we observed that using minf with floats in kernel code occasionally
appeared to use an integer overload and hence round floats down giving hard to find bugs. To avoid
ambiguity, we switched to using explicit versions of these generic functions, fminf in this case.
14 We had some fun figuring out how to define bRadius as a constexpr since the expression involves
a tangent but constexpr objects are evaluated at compile time and cannot use the standard maths
library. We wrote our constexpr trig functions using power series to do the job, these can be found in
cxconfun.h.
15 Actually, we have assumed that there are no gaps between detectors along the z-axis. For most scanners
this is not true, so we might also have to search for neighbouring blocks in the z direction as well as phi.
16 See for example the online article in Wikipedia.
17 Dealing with edges always needs consideration when applying image filters, by clamping to the edges
here we are tacitly assuming the illumination beyond the borders of the image is the same as at the
edges. This is a reasonable starting guess.
sanet.st
9
Scaling Up
Sometimes one GPU is not enough and CUDA contains tools that allow you to spread a big
calculation over more than one GPU. At this point you are moving into the world of high-
performance computing (HPC) and are probably more interested in using existing applica-
tions rather than developing your own code from scratch. However, you may need to write
additional code to tailor a standard application to your specific problem. In addition, we
think that it is always helpful to have some understanding of what is going on to enable
multiple GPUs to share a big task. There are two stages in the scaling up:
Stage 1 A single workstation with multiple GPUs
This can simply involve plugging a second GPU into an existing PC or buying a pre-
configured high-end workstation with four or eight GPUs. Key considerations are how the host
code manages workflows on these GPUs and how the GPUs can access each other’s data.
If you are moving up to large scale GPU computing, then it is probably time to use Linux
rather than Windows. This is because currently NVIDIA does not support some of the best
features for managing multiple GPUs in Windows. Most dedicated workstations with
multiple GPUs can be supplied running a version of the Linux operating system.
Figure 9.1 (a–c) shows 3 levels of scaling up from a basic system with just one GPU.
Figure 9.1 (a) shows the first step up – a single PC with two GPUs on the PCIe bus.
Figure 9.1 (b) shows an advanced workstation with a 20 core CPU and four high-end GPUs
connected with two PCIe buses and additional multiway connections between pairs of GPUs
using NVIDIA’s proprietary NVLINK. The final scaling step is shown in Figure 9.1 (c)
where multiple workstations of the types shown in Figures 9.1 (a) or (b) are linked together
by a switched network that can route data between any pair of workstations.
In advanced HPC systems the network interconnect is likely to be InfiniBand rather than
ethernet. In this context a single workstation is often referred to as a node. The topology of
the network interconnect can also become complicated with communication between “close”
nodes having less latency than communication between “distant” nodes. The network shown
in Figure 9.1 (c) has two levels of latency; each group of 18 nodes is connected to a single
InfiniBand L2 switch that allows those nodes to communicate with each other directly.
Communication between different 18-node groups is possible using the second layer of
L1 switches. The L1 layer has slightly higher latency and less aggregate bandwidth. The
L2 connections to the nodes are 40 Gbits/sec and the connections between the L1
and L2 switches are 2 40 Gbits/sec. Figure 9.1 (c) is taken from the Mellanox
website: (https://community.mellanox.com/s/article/designing-an-hpc-cluster-with-mellanox-
infiniband-solutions). This company, which is now owned by NVIDIA, manufactures
InfiniBand equipment and interested readers can find out much more from their website.
293
294 Scaling Up
sanet.st
At the time of writing three of the five most powerful supercomputers in the November
2020 TOP500 list: (www.top500.org/lists/top500/2020/11/) included Volta or Ampere
GPUs. Number 2 on the list is the Summit machine at Oak Ridge: (www.olcf.ornl.gov/
olcf-resources/compute-systems) which has 4608 nodes. Each node on Summit has two IBM
Power9 CPUs and six NVIDIA V100 GPUs. Interestingly, the latest NVIDIA GPUs have
hardware features to enhance IO between these specific IBM CPUs and the GPUs.
9.1 GPU Selection 295
Versions of the Linux operating system dominate in the world of HPC and some recent
NVIDIA CUDA developments such as demand paged virtual unified memory management
are currently only supported on Linux.
This chapter discusses the additional programming tools and techniques that are used to
scale up a calculation from a single GPU to these more powerful systems.
The simplest programming paradigm for a single workstation, such as that shown in
Figure 9.1 (a) or (b), is for the host to control all data flow to and from the GPUs; in such
cases, essentially, the only code modification required is for appropriate calls to the
cudaSelectDevice function to be used in host code to specify which GPU subsequent
CUDA calls that run kernels or access GPU memory refer to. For larger scale systems such
as shown in Figure 9.1 (b) this approach becomes cumbersome and NVIDIA provides tools
to manage both peer-to-peer communication between GPUs over either PCIe or NVLINK
and also various unified memory addressing modes that allow the programmer to treat CPU
and GPU memories as a single entity without needing explicit cudaMemcpy calls. This is
discussed below, in our section on Unified Memory. Finally, the management of communi-
cation between HPC systems with large numbers of interconnected nodes as shown in
Figure 9.1 (c) is best manged using MPI; a section on MPI is also included in this chapter.
Although MPI is not strictly a CUDA topic, it shares many of the same ideas and indeed,
because MPI emerged in the early 1990s, well before CUDA, it almost certainly influenced
the design of CUDA. MPI is an acronym for message passing interface.
18 }
25 int ngpu = 0;
26 cudaGetDeviceCount(&ngpu);
27 printf("Number of GPUs on this PC is %d\n",ngpu);
work; the steps in the remainder of the loop body will apply to this GPU.
○ Line 33: Create a host buffer using good old-fashioned malloc rather than a container class. This
will allow a more natural transition to CUDA managed memory in later examples. The pointer is
held in the temporary variable float *a and is then stored in the host_buf vector using
push_back.
298 Scaling Up
○ Lines 34–35: Create and store pointers to the device a and b data buffers. Note, that like many
CUDA functions, the syntax of cudaMalloc means that a temporary variable must be used for
the result, here we reuse a for this purpose.
○ Line 36: Fill the host buffer host_buf[gpu] with some random data. In a real-world example
host_buf[gpu] would receive different data for different values of the index gpu.
○ Line 37: Copies the contents of host_buf to the device buffer dev_a.
○ Line 38: Launches a kernel using the newly created buffers. The kernel will be run on the
currently active GPU, in this case gpu. This call is not blocking on the host and we proceed to
the next pass through the loop using gpu+1 while work is still running on gpu.
• Line 40: Additional host work running in parallel with the kernels executing on the GPUs could be
placed here.
• Lines 41–44: Loop over GPUs to wait for the pending kernel to complete and then copy the device
results buffer, dev_b, back to the host. Note the call to cudaSetDevice(gpu) at the start of
each pass through the loop.
• Line 45: Additional host work to process the results from the kernels could be done here.
• Lines 46–53: Tidy up at the end of the main program. We use yet another loop over the GPUs; free
the allocated host and device memory blocks and reset each device.
The code in Example 9.1 will work but it is rather ugly; in more complicated situations it
would be easy to overlook a necessary cudaSetDevice causing hard to find bugs. In
addition, for many problems, for example, iterative grid PDF solvers, data needs to be
exchanged between GPUs as the calculation progresses. While this can always be done by
copying data back and forth between the host and GPUs, a better solution would be a direct
sanet.st the host. Such direct inter GPU or peer-to-
exchange of data between the GPUs bypassing
peer communication is indeed possible over the PCI bus or NVLINK (if present). This will
be explained in the next section as part of a more general discussion of CUDA support for
the management of the multiple memory subsystems on a typical GPU-based workstation.
Table 9.2 Values of the CUDA cudaMemcpyKind flag used with cudaMemcpy functions
source and destination memory pointers refer to the host or GPU memory. The allowed
values of this flag are shown in the Table 9.2.
Using cudaMemcpyDefault whenever possible is recommend as this makes your code
more portable and removes a potential source of bugs when this flag is specified incorrectly
which leads to undefined behaviour. The other important change is that using
cudaMemcpyDefault allows cudaMemcpy to copy data directly between two different
GPUs on the same system. The data flow in such transfers will be directly over NVLINK (if
available) or the PCIe bus. Such transfers are usually referred to as peer-to-peer or P2P
transfers and do require a little extra preparation in CUDA programs as discussed next.
Example 9.2 p2ptest kernel demonstrating P2P operations between two GPUs
12 dst[idx] = 2.0f*src[idx];
13 }
. . .
20 // check for p2p access
21 int p2p_1to2;
cudaDeviceCanAccessPeer(&p2p_1to2, gpu1, gpu2);
22 int p2p_2to1;
cudaDeviceCanAccessPeer(&p2p_2to1, gpu2, gpu1);
23 if(p2p_1to2 == 0 || p2p_2to1 == 0) return 1;
24 cudaSetDevice(gpu1);
cudaDeviceEnablePeerAccess(gpu2, 0);
25 cudaSetDevice(gpu2);
cudaDeviceEnablePeerAccess(gpu1, 0);
sanet.st
Note these flags can be combined with each other using the arithmetic OR operator “|”
It is also possible to go further and allow the Host and all attached GPUs to use the same
region of host memory directly without the need for explicit transfers using CUDA zero-
copy memory.
writes to the memory at the expense of greatly slowing down host reads from the memory.
Note write combining is an Intel PC feature not a CUDA feature, it is likely to be useful in
cases where the host often writes to the pooled memory but rarely if ever reads from it.
Notice that when using zero-copy memory care must be taken to properly synchronise
writing and reading from that memory by the host and devices. There will be not any
cudaMemcpy calls to provide implicit synchronisation. Thus, extra care needs to be taken
with appropriate use of cudaDeviceSynchronize and CUDA events to ensure that no
read after write errors occur.
More details on managed memory can be found in an appendix of the NVIDIA CUDA Cþþ
Programming Guide.
The performance of the various memory allocation methods is explored with the set of
Examples 9.3–9.10. In these examples we use our best reduce kernel from Chapter 3 with
these different methods.
Example 9.3 shows the reduce_warp_vl kernel and the simple main routine that
performs all the tests.
Example 9.3 Managed memory timing tests reduce_warp_vl kernel and main routine
12 v += w.shfl_down(v,16);
13 v += w.shfl_down(v,8);
14 v += w.shfl_down(v,4);
15 v += w.shfl_down(v,2);
16 v += w.shfl_down(v,1);
17 if(w.thread_rank() == 0)
atomicAdd(&sums[b.group_index().x],v);
18 }
38 if(test==0)
reduce_classic(blocks,threads,dsize,t2);
39 else if(test==1)
reduce_classic_pinned(blocks,threads,dsize,t2);
40 else if(test==2)
reduce_thrust_standard(blocks,threads,dsize,t2);
41 else if(test==3)
reduce_thrust_pinned(blocks,threads,dsize,t2);
42 else if(test==4)
reduce_thrust_hybrid(blocks,threads,dsize,t2);
43 else if(test==5)
reduce_zerocopy(blocks,threads,dsize,t2);
44 else if(test==6)
reduce_managed(blocks,threads,dsize,t2);
45 double t1 = tim.lap_ms();
46 sanet.st
printf("test %d total time %.3f kernel time %.3f
ms\n",test,t1,t2);
47 std::atexit([]{cudaDeviceReset();});
48 return 0;
49 }
Example 9.4 shows our first test which uses standard cudaMalloc for device memory
allocation. This sets our performance baseline.
9.5 Unified Memory (UM) 305
• Line 57: Here we call fill_buf the fill the host array host_buf with data. The sum of these
values is returned by the function and stored in check.
• Line 58: Starts a timed block of code ending at line 64.
• Line 59: Copies the host data to the device using an explicit cudaMemcpy call.
• Line 60–61: Call the reduce_warp_vl kernel twice, first to get thread block sums into the device
array dev_sum and then a second time to sum the elements of dev_sum to the first (and only)
element of the array dev_tot. Note there is no need for a host array corresponding to dev_sum.
• Line 62: Copy the final total from dev_tot to host_tot using cudaMemcpy to copy one word
back to the host.
• Line 63: Call cudaDeviceSynchronize() prior to the timing call in the next line. This call is
not strictly necessary because the previous cudaMemcpy call is blocking on the host.
• Line 64: Find the time taken to run the kernels and perform the memory transfers and store the result
in t.
• Line 65: Check the GPU reduce calculation agrees with the direct calculation on the host.
• Lines 66–68: Explicitly free all allocated memory buffers and return to caller.
As discussed previously the performance of the memory bound kernels, such as this
reduce example, can be improved by a factor of two by using pinned memory on the host.
The pinned memory version is shown in Example. 9.4. We note the kernel execution time is
reduced from about 12.7 ms down to about 5.9 ms. This is due to the speed-up of
cudaMemcpy transfer in lines 79 and 82.
85 if(check != host_tot)
printf("error classic pinned: sum %u check %.0f\n",
host_tot,check);
86 cudaFreeHost(host_buf);
87 cudaFree(dev_buf); cudaFree(dev_sum); cudaFree(dev_tot);
88 return 0;
89 }
• Line 72: Here we use cudaMallocHost instead of malloc for the allocation of the large host
buffer host_buf. This creates a pinned memory allocation.
• Line 86: Here we use cudaFreeHost to free the memory allocated to host_buf rather than
using free as was done in the previous version.
Using pinned memory gives a factor of 2 speed-up for this kernel which is limited by the speed of
memory access. These are the same results we found in Chapter 3.
The next Example 9.6 uses thrust vectors as containers for the memory allocations.
This is our standard practice for most examples so it is shown here for the sake of
completeness. The vector host_buf is again allocated using pinned host memory.
Example 9.6 Managed memory test 3 using thrust for memory allocation
103 cudaDeviceSynchronize();
104 t = cuda.lap_ms(); // timed block end
105 if(check != host_tot[0])
printf("error pinned thrust: sum %u check %.0f\n",
host_tot[0],check);
106 return 0;
107 }
The next Example 9.7 is new and illustrates the use of zero-copy memory where the
device directly accesses a block of host pinned memory, removing the need for any data
transfer between the host and device.
125 cudaFreeHost(host_buf);
126 cudaFreeHost(host_sum);
127 cudaFreeHost(host_tot);
128 return 0;
129 }
Finally, in Example 9.8 we illustrate the use of managed memory, where the same pointer
is used on both the host and the GPU.
310 Scaling Up
• Lines 142–144: Allocate the same three blocks of memory as before, but this time we use
cudaMallocManaged which means that the same pointer is used for both the host and device
code. Thus, in this version we have left off the prefixes host_ and dev_ used for pointers in the
previous versions. On Windows and CC < 6.00 devices full size blocks of pinned host memory and
GPU memory will be allocated. On Linux and CC ≥ 6.0 demand page virtual memory is used and
these will be mapped to physical host or GPU memory by the driver when read or written.
• Line 145: Fills buf on the host. Note that the driver assumes that managed memory allocations are
initially valid on the GPU. Thus on Windows and CC < 6.00 devices the entire memory block buf
9.5 Unified Memory (UM) 311
is copied from the GPU to the host before being written to by the host. This accounts for most of the
performance drop observed in our test.
• Line 147: This is the first kernel call and the contents of buf will be copied back to the GPU before
the kernel starts. There is no copying associated with the array sum because the driver assumes that
this array is valid on the GPU on first use.
• Line 148: This second kernel call does not cause any implicit memory transfers.
• Line 151: This use of tot[0] by the host will trigger the array tot to be copied from the GPU to
the host.
• Lines 152–154: Use cudaFree to release the managed memory allocations.
The full version of the code for these tests includes additional tests 2, 4 and 7 which are
not shown here. Tests 2 and 4 are variations on the thrust test 3 which perform similarly to
either test 1 or test 3. Test 7 uses advanced memory management currently only available
on Linux.
The timing tests included in the above examples demonstrate that:
1. Using pinned host memory gives a performance boost of roughly a factor of 2 as
compared to using normal memory.
2. Zero-copy memory works just as well as explicit copying data between the host and GPU
memory for the reduction example.
3. Managed memory is significantly slower on a Windows platform.
The reason that we did not see a significant performance hit when using zero-copy memory
(test 5) in Example 9.7 is that we used a single data copy from the host to the GPU across the
PCIe bus. When using zero-copy memory this is done implicitly when the first kernel reads
the input data buffer. We have designed this kernel to read the input data very efficiently, so
this operation is essentially about the same speed as cudaMemcpy. If, however, the GPU
reads the input data several times during processing, then the zero-copy version of our
example would perform worse because multiple reads across the PCIe bus would be
necessary. In contrast, in the cudaMemcpy version of the example (test 1) only a single
read across the PCIe bus is necessary. This is because afterwards the input data is stored in
GPU memory.
We can illustrate this by adding two additional kernel calls to our example as illustrated in
Example 9.3. The added kernels first replace the content of the device data buffer by the
square root of their original values rounded back to an integer and then the second new
kernel squares this result. Thus, both new kernels read and write to the whole device data
buffer. The execution times of the timed block and kernels in Examples 9.2 and 9.3 are
summarised in Table 9.3.
As can be seen from Table 9.3 adding the two extra kernels to Example 9.2 makes little
difference to the kernel execution times except for the case of zero-copy memory where total
kernel time has increased from about 6.5 ms to 19.4 ms, about a factor of 3. The increase is
entirely due to the fact that in 9.3 the kernels have to read the 16 MB memory buffer three
times and write it twice instead of reading it just once as in Example 9.2. The additional
kernel computation takes only about 0.8 ms.
312 Scaling Up
Table 9.3 shows times measured by the host code timers, the main features are 3-fold
increase in kernel time between the 2-kernel and 4-kernel cases for the zero-copy memory
(test 5) and the 4-fold increase in time for the managed memory case (test 6).
We can get more precise information from nvprof and as shown in Table 9.4. The values
shown in this table are the averages of 5 separate runs.
In Table 9.5 we see more accurate estimates of the true individual kernel execution
times. In all cases except for test 5 the time required to run all four kernels is only about
0.91 ms. This should be compared to the values of about 6 ms obtained from the host-based
9.6 A Brief Introduction to MPI 313
NB for Test 6 the automatic H2D memory transfer was performed as 512 separate transfers of 128 KB
blocks and the (unnecessary) initial D2H transfer was performed using 70 blocks of up to 1 MB in
size.
timers. This difference is due to various overheads such as kernel launch and device
synchronisation that occur when CUDA kernels are run. Note that some of these overheads
increase significantly for tests 3–6. The device synchronisation overheads for tests 3 and 5
have a significant impact on the total job time for the short duration kernels being tested in
Examples 9.2 and 9.3 but are still only a few ms and thus would be less important for longer
duration kernels expected in real-world applications. In the case of test 6 however the
Memcpy overheads are significant and mean that, at least with the present Windows driver,
managed memory is only practical for applications where data is resident in GPU memory
for the duration of the calculation.
systems, were starting to exploit clusters of inexpensive PCs for the same purpose. MPI was
very influential in the success of so-called Beowulf PC clusters which started in 1998. These
clusters are arguably the blueprint for all present-day HPC computer systems which can have
a vast number of nodes. MPI is still the dominant software tool used for inter-node
communication in these systems. Table 9.6 shows the evolution of MPI over time.
A key feature of distributed systems is that each node has its own separate host memory. If
nodes need to see each other’s data, then that data must be copied across the network
connections. The bandwidth of the interconnect between nodes is often the limiting factor in
overall compute performance. The role of MPI is to optimally manage these transfers. From
the beginning MPI provided a comfortable and intuitive programming paradigm for coding
sanet.st
parallel programs and this explains its rapid uptake and continuing popularity. In fact, the
MPI paradigm is essentially the same as that used by present-day NVIDIA CUDA code; it is
reasonable to argue that MPI is the mother of CUDA. This is good news for readers of this
book – you essentially already know the ideas used in MPI.
• The first shared idea is that an MPI program is run by launching a number of cooperating
processes that run on the nodes of the distributed system. A common configuration would
be a number of processes equal to the total number of compute cores on the system. MPI
would then ensure that each compute core would run one process. Thus, if each node has a
20 core CPU it would be natural to use 20 x (number of nodes) as the required number of
processes when launching an MPI job. This is directly analogous to choosing the number
of thread blocks and thread block size when launching a CUDA program.
• The second shared idea is that you only write one program and that program runs on all
MPI processes just like a CUDA kernel is one piece of code that is run by all GPU threads.
• The third shared idea is that each MPI process can find its own rank in the MPI job. If there
are a total of np MPI processes in the job then a process’s rank is an integer in the range
0 to np‒1. This is analogous to finding a CUDA thread’s rank using the built-in
threadIdx and blockIdx constants or simply grid.rank() if using
cooperative groups.
• The fourth shared idea is that some calls to some MPI functions which involve data
transfers are blocking in the sense that the calling MPI process will wait until the transfer is
complete. For example, the MPI_Send function, which sends data from the calling
9.6 A Brief Introduction to MPI 315
process to one or more other process, will wait until the transfer is complete
before it continues.1 This is analogous to cudaMemcpy being blocked on the host in
CUDA code. There is also a non-blocking MPI_Isend call which is analogous
to cudaMemcpyAsync and an MPI_Wait call which is analogous to
cudaDeviceSynchronize.
• Finally, most MPI functions return error codes to indicate success or failure; this again is
the same approach as used by most CUDA functions.
An important difference between MPI and CUDA is that MPI uses a distributed memory
model whereas CUDA uses a shared memory model. Originally, MPI had no direct support
for shared memory, thus each process had its own copy of all MPI managed memory even if
these processes were running on the same CPU. In MPI the only way to process shared
information was by using the interconnect to transfer data. In the 1990s that was a perfectly
reasonable model as all PCs would have had a single CPU with one processing core. More
recent versions of MPI will recognise cases where two or more processes are on the same PC
and will implement data exchange using direct memory copies to implement the exchange
whenever possible. There is now also some support for shared host memory access.
Since about 20132 some versions of MPI are “CUDA-aware” which means that MPI will
use NVLINK to transfer data directly between GPU memories on the same workstation. In
this case CUDA unified virtual addressing (UVA) and MPI are used together, making
programming very straightforward.
The full MPI library contains literally hundreds of functions but fortunately straightfor-
ward applications can be built using only a small subset of these functions. Some of these
core functions are shown in Table 9.7.
More detailed information on MPI is readily available online. The full specifications can
be obtained from www.mpi-forum.org/docs/ and tutorials are available from https://
mpitutorial.com/ and many other websites.
The first four functions in Table 9.7 will appear in every MPI program; the MPI
documentation recommends that MPI_Init is called at the very start of a program and is
necessary before any other MPI function is used. The arguments allow command line
parameters to be passed to MPI, but in practice they are not used. The next four commands,
Bcast, Scatter, Gather and Reduce are examples of MPI collective commands
which involve collaboration between all the calling processes. In the mid-1990s, MPI was
my introduction to parallel programming and I was immediately struck with the elegance
and simplicity of this approach – the same piece of code runs on all processors which share a
given task in a symmetrical way. One consequence of this paradigm is that speeding up a
calculation by adding more processors is in principle trivial, as no code needs to be
changed.3 Of course, as discussed elsewhere, not all algorithms deliver a speed-up propor-
tional to the number of processors. In particular, communication and other overheads limit
the speed-up that can be achieved by adding more processors for any particular parallel
computing task.
A simple complete MPI program to perform reduction (what else) is shown in Example 9.4:
It is important to keep in mind that MPI programs are like CUDA kernels; an instance of
the code will be run on each of the processors used.
316 Scaling Up
For brevity we show examples of function calls in column 1 not the function prototypes. The functions
all return an integer error code which is zero (MPI_SUCCESS) if no error has occurred or positive
otherwise. We use standard names for the arguments as explained at the bottom of the table.
MPI_Init(&argc, &argv) Mandatory initialisation, should be first statement
in main.
MPI_Finalize() Close MPI on this node, usually last statement.
MPI_Comm_size(MPI_COMM_WORLD, Sets int procs to the number of processes in
&procs ) this job.
MPI_Comm_rank(MPI_COMM_WORLD, Sets int rank to this process’s rank in the job.
&rank )
MPI_Bcast(sbuf, size, type, Process root sends the data in its copy of sbuf
root, comm) to all other processes’ copy of sbuf.
MPI_Scatter(sbuf, size1, type1, Process root sends subsets of the data in its copy
rbuf, size2, type2, root, of sbuf to all processes’ copy of rbuf.
comm)
MPI_Gather(sbuf, size1, type1, Process root accumulates subsets of the data
rbuf, size2, type2, root, from each process’s copy of sbuf to its copy of
comm) rbuf.
MPI_Reduce(sbuf, rbuf, size, Each element of root’s copy of rbuf is set to
type, op, root, comm) the sum (or other function depending on op)
over all nodes of the corresponding element of
sbuf. The resulting vector rbuf is thus the
result of a vector operation op. Set op to
sanet.st MPI_SUM for summation.
MPI_Allreduce(sbuf, rbuf, size, Same as MPI_Reduce except all processes get
type, op, comm) the result in their copy of rbuf. The argument
root is not needed.
In the argument lists sbuf and rbuf are pointers to arrays containing the send and receive buffers for
operations. The data type of these buffers is indicated by the type arguments which are set to a defined
MPI keyword, for example, MPI_FLOAT. MPI defines keywords for all the usual data types and
additionally allows user defined types. If this keyword occurs twice in the argument list then sbuf and
rbuf can be different data types and MPI will perform type conversion during the operation. The
argument root is an integer indicating the sending node for send functions like Bcast and Scatter
or the receiving node for operations like Gather and Reduce. The argument op can be set to
MPI_SUM in the reduce function to indicate a classic addition of elements is required but other
operations are available. The final argument comm is an MPI communicator which specifies which of the
available processes will participate in this call. The most commonly used communicator is
MPI_COMM_WORLD which uses all processors. Other user defined communicators for subsets of the
nodes are possible, for example, in the case of multi core nodes, communicators could be used to group
together MPI processes running on the cores of a single node.
03 #include <stdio.h>
04 #include <stdlib.h>
05 #include <string.h>
06 #include <mpi.h>
9.6 A Brief Introduction to MPI 317
07 #include <vector>
23 int check = 0;
24 if(rank==root) { // fill with test data
25 for(int k=0;k<size;k++) sbuf[k] = k+1;
26 for(int k=0;k<size;k++) check += sbuf[k];
27 }
40 return 0;
41 }
318 Scaling Up
• Lines 30–33: This is the place in the program where all processes do unique work in parallel. In this
example, we do very little work; the elements of rbuf are summed to int framesum. In a real-
world example, considerably more work might be done here.
• Line 36: Here we use the many-to-many MPI operation Allreduce. The first and second
arguments are vectors of equal length specified by the third argument and types specified by the
fourth and fifth arguments. In our case the vectors are actually scalars of length one and so the sum
of the values of procsum across all processes is placed in the value in fullsum for all the
processes. The fifth argument MPI_SUM specifies the type of reduction operation required. Other
options for this variable include MPI_MAX, MPI_MIN and MPI_PROD. Had the value of the third
argument been greater than one, the first two arguments would have been vectors, say a and b then
each element i the sum of the a[i] elements over all processors would be placed in b[i].
A many-to-one version of MPI_Allreduce also exists and is simply called MPI_Reduce. In
this version only a single process has its receive buffer updated with the summed values and the rank
of that process is specified by an additional argument placed just before the communicator.
• Line 37: Here we print some results; there is no if clause attached to this printf so each process will
print a line.
An example of compiling and running an MPI program is shown in Example 9.11. Here
we are using the Linux Bash shell running under Windows 10 with OpenMPI installed using
SUDO. This provides the compiler mpic++ (a front end to gþþ) which compiles and links
MPI code and an executable mpirun to launch MPI jobs.
MPI_Scan(sbuf, rbuf, count, For each process, the result vector rbuf is an
op, comm) element-wise inclusive prefix scan operation on the
vectors sbuf ordered by process rank. The
scan operation in indicated op, e.g. MPI_SUM.
MPI_Alltoall(sbuf, size1, For each process, sbuf is partitioned into subblocks
type1, rbuf, size2, type2, of size size2 with stride size1 and these blocks
comm) are distributed between processes such that process
p sends its subblock q to process p’s subblock
position in rbuf of process q.
MPI_Send(buf, size, type, These two functions are used together to allow two
dest, tag, comm) processes to exchange data; one process uses
MPI_Recv(buf, size, type, MPI_Send to send data to a receiving process
source, tag, comm, &status) which must use a matching MPI_Recv call. The
int variables source and dest are set to the
ranks of the sending and receiving processes
respectively. The values of int tag must be the
same for both processes
MPI_Barrier(comm) Waits for all processes in communicator comm to
reach this point in the code. Similar to
cudaDeviceSynchronize
MPI_Comm_split(comm, color, Create new communicators newcomm from subsets
key, &newcomm) of processes in communicator comm where
processes specifying the same value of color are
in the same new communicator. The variable int
sanet.st
key controls the rank order in the newcomm
communicators, using the process rank for key is
a common choice which maintains the original
rank ordering.
cores of a multicore PC. In more complicated situations involving distributed processors a configur-
ation file can be specified containing the names and network addresses of the hosts to be used.
• Line 3: The output from line 20 of Example 9.10; notice that only output from the root process
appears here.
• Lines 4–11: The output from line 37 of Example 9.10; now there is one line of output for each
process, but the order in which the lines appear is effectively arbitrary. The value of check is only
correct for the process of rank zero. That is expected because only process root sets this value. All
processes report the correct value for fullsum.
42 MPI_Finalize(); // tidy up
43
42 return 0;
43 }
• Lines 1–9: A simple function to print a matrix of size nx × ny. The first argument is a title.
• Lines 13–17: Initialise MPI and set the number of processes and rank of current process in nproc
and rank.
• Line 18: Sets the value of the root process in root using optional user input.
• Lines 20–24: Allocate arrays to process a N N matrix where N is set equal to the number of MPI
processes nproc. For each process, the vectors row and col will contain one row and one column
of this matrix. In addition, for the root process only, the vector mat will store the whole matrix. In
line 24 an empty vector mat is created for each process.
• Lines 26–29: For the root process only we create the full matrix by storing elements in root’s copy of
mat using the push_back member function of std::vector. The matrix is filled with integers
in the range 1 – N2 and then printed. Non-root processes retain zero sized versions of mat.
or
std::<vector> bigv; bigv.reserve(size);
The former will allocate memory and initialise all elements of bigv to zero. The latter will allocate
memory but not perform any initialisation; hence it might be faster.
• Line 30: MPI_Barrier is used here to ensure that all processes have been launched and are ready
for the subsequent MPI_Scatter operation in line 33. This is probably not strictly necessary, as
all the core MPI routines used in our examples are themselves locally blocking on their process.
However, unlike the CUDA case we might be synchronising across a network with heterogeneous
processors. If in doubt I recommend using lots of synchronisation calls while developing MPI
code – the unnecessary ones can be removed later in production code.
• Line 33: MPI_Scatter is used here to send blocks of N elements from root’s copy of mat to the
vector row in each process including the sending process root. At this point each process has one
row of the matrix in its row vector.
• Line 36: The MPI_Alltoall call used here is the key element of this example. For each process,
individual elements of its row vector are sent to an element of the col vector in each process.
9.6 A Brief Introduction to MPI 323
Specifically, process p will send its element row[q] to the element col[p] in processes q. This
results in the vectors col containing the columns of the original matrix.
• Line 39: This MPI_Gather copies the col vector of each process to a row of the original matrix mat
on the root process. This results in the original columns of mat now being its rows.
• Line 40: We print mat again to check the result.
The results of compiling and running the program with eight processes is shown in
Example 9.13.
matrix
1 2 3 4 5 6 7 8
9 10 11 12 13 14 15 16
17 18 19 20 21 22 23 24
25 26 27 28 29 30 31 32
33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48
49 50 51 52 53 54 55 56
57 58 59 60 61 62 63 64
transpose
1 9 17 25 33 41 49 57
2 10 18 26 34 42 50 58
3 11 19 27 35 43 51 59
4 12 20 28 36 44 52 60
5 13 21 29 37 45 53 61
6 14 22 30 38 46 54 62
7 15 23 31 39 47 55 63
8 16 24 32 40 48 56 64
While the results of running mpialltoall are as expected, one could argue that using
eight processors to transpose an 8 8 matrix is overkill for a small problem. However, there
is an interesting scaling feature here; we need N processors to handle N2 data elements.
Moreover, although we used alltoall to transfer single ints between processors, we
could modify the code to transfer sub-matrices of a much larger matrix. Large problems in
linear algebra are indeed an important application area for MPI.
This is a brief introduction to MPI but hopefully it will help you get started with the mixed
CUDA and MPI applications. In the next chapter we discuss tools for profiling and debug-
ging your code.
324 Scaling Up
Endnotes Chapter 9
1 Actually, this is not quite true, the MPI_Send call uses a data buffer argument which is an array in the
calling process’s local memory. The call causes this buffer to be copied across the interconnect to a
corresponding buffer in the receiving processor’s local memory space. MPI_Send will return once that
data buffer in the calling process is no longer required and can be safely reused. An MPI implementation
may choose to copy the sender’s data buffer to another system buffer on the local host which would
allow a faster return as compared to time taken to directly transfer the data across an interconnect. In
neither case is there a guarantee that the reading process has actually got the intended data. There are a
number of alternative versions of MPI_Send which cater for various synchronisation needs. For
example, MIP_Ssend will block the sending process until the data really has reached the
receiving process.
2 See the NVIDIA blog “What is CUDA Aware MPI” https://devblogs.nvidia.com/introduction-cuda-
aware-mpi/
3 In the 1980s I had been involved in projects at CERN and elsewhere using multiple processors to speed-
up real-time event data processing. This involved chaining several processors into a pipeline through
which event data flowed and where each processor performed a different step in the processing chain.
Here one needed a different program for each processor and all the programs had to be tuned to take the
same amount of time. Hence here the programming complexity increased with every additional
processor added to the chain.
sanet.st
10
Sadly, not all newly written code works perfectly the first time it is run. Debugging and
performance tuning of code often takes a significant proportion of code development time.
Fortunately, the NVIDIA SDK provides many tools for profiling and debugging both host
and GPU code. Some of these tools, such as nvprof, have been around since early releases of
CUDA and some, such as Nsight Systems, are more recent. The tools used for Windows and
Linux systems are somewhat different in detail but in general give similar information. In
this chapter, our examples will mostly be taken from the Windows toolset as this is the
system used to develop the code for this book. Exactly the same methods can also be used
for Linux development; it is just that the user interfaces for some of the tools differ.
of terms.
○ Line 15: Sets termsum to the first term x.
○ Lines 16–20: This simple for loop adds the remaining terms to termsum using floating
point arithmetic.
325
326 Tools for Profiling and Debugging
• Lines 23–31: The kernel function gpu_log calls logsum for the required number of x values
using thread linear addressing.
○ Line 23: The input arguments are logs a floating point array that holds the calculated log(1þx)
values, terms the required number of terms to be used when summing the power series, steps
the number of different uniformly spaced x values to be used and step_size the spacing
between x values. We expect the calculation time to be directly proportional to the product of
steps and terms.
95 _controlfp_s(nullptr,_DN_FLUSH,_MCW_DN);
99 hostint -= 0.5f*(logsum(0.0f,hterms)+logsum(1.0f,hterms))
*step_size;
100 ferr = 100.0f*(logint-hostint)/logint; // host error
101 printf("host int %f frac err %e %%\n",hostint,ferr);
102 double ratio = (thost*terms)/(tgpu*hterms); // speedup
103 printf("times gpu %.3f host %.3f gpujob %.3f ms
speedup %.1f\n", tgpu, thost, gpujob, ratio);
sanet.st
○ Line 25: Set tid to the rank of this thread in the grid-block using standard thread
linear addressing.
○ Lines 26 and 29: Standard loop where the threads share the work for a total of steps passes.
○ Lines 27–28: Calculate the current x value directly from tid; there is a tacit assumption here that
• Lines 62–67: Here the user can optionally set parameters for the job. These parameters are:
○ steps: This defines the number of x steps spanning the range [0,1]. The computation time should be
directly proportional to the number. This value is set to the power of 2 entered by the user in line 62.
○ blocks and threads: These are CUDA grid configuration; our standard defaults of 256 and
○ hterms: This is the same as terms but controls the number of terms used by the host
calculation. Since it turns out that the host is much slower than the GPU, it is convenient to use
a separate value.
10.1 The gpulog Example 329
• Line 68: Uses malloc to allocate the host array logs to hold GPU calculated log values.
• Line 69: Declares and starts the timer gpuall intended to measure the time for all CUDA steps.
• Lines 70–72: Declare device arrays for the calculation. We choose to use cudaMalloc here rather
than thrust to avoid complications with the profiling and debugging tools. The arrays logs and
dev_logs are of size steps and will hold the full set of log values evaluated by the GPU. The
device array dev_sums has size of blocks and will hold the block sums produced by the first
reduce step in line 77. The variable dev_tot will hold the final sum of all log values from the
second reduce step in line 78.
• Line 73: This sets the step size required to span a unit range in steps steps. Notice we divide by
steps‒1 not steps; there is a good chance of a hard-to-spot, off-by-one error here if you get
this wrong.
• Lines 74 and 82: These lines, which bracket the CUDA calculation, define and use the host-based
timer tim. This measures the time to run the kernels and transfer results back to the host but
excludes time for cudaMalloc.
• Lines 75–81: This is the code CUDA calculation that we are interested in profiling and debugging.
○ Line 75: Calls the gpu_log kernel to fill dev_logs with steps values of log(1þx) based on the
single value logs[steps‒1] in the later code, here we choose to copy the whole array – this
makes the profile more interesting.
○ Lines 77–78: Perform a 2-step summation of the values in dev_logs down to the single value
○ Line 81: Finally uses cudaDeviceSynchonize to ensure all operations are complete before
Some results from running the code are shown in Example 10.2. One feature of these
results is a significant overhead of several hundred ms in addition to running the kernels in
the CUDA section of the code. This reflects the cost of cudaMalloc in lines 76–78. To
understand what is going on here in more detail we need profiling.
330 Tools for Profiling and Debugging
(a) A small job with 216 steps and 100 terms for both GPU and host. This is too small a calculation
to demonstrate the full power of the GPU.
D: >gpulog.exe 16 256 256 100 100
gpu log(2) 0.688172 frac err 7.178e-01%
gpu int 0.386245 frac err 1.269e-02%
host int 0.386245 frac err 1.268e-02%
times gpu 0.400 host 12.167 gpujob 177.555 ms speedup 30.4
(b) Same as (a) but using 224 steps. The accuracy of the results are unchanged but the GPU is now
about 100 times faster than the host.
D: >gpulog.exe 24 256 256 100 100
gpu log(2) 0.688172 frac err 7.178e-01%
gpu int 0.386245 frac err 1.267e-02%
host int 0.386245 frac err 1.267e-02%
times gpu 27.611 host 3160.105 gpujob 202.490 ms speedup 114.5
(c) The number of GPU terms is increased from 100 to 105. Now the GPU is fully utilized and
outperforms the host by a factor of about 1600. The accuracy of the GPU calculation of log(2) and
the integral has also increased.
sanet.st 100
D: > gpulog.exe 24 256 256 100000
gpu log(2) 0.693134 frac err 1.874e-03%
gpu int 0.386294 frac err -8.701e-06%
host int 0.386245 frac err 1.267e-02%
times gpu 1861.515 host 3079.824 gpujob 2049.234 ms speedup 1654.5
The results of running the gpulog example with nvprof are shown as Example 10.3.
Accurate times are reported for each of the six steps in the CUDA pipeline ( lines 68–81 of
Example 10.1). For any particular function, nvprof reports the number of times it occurred in
the pipeline, the total time for all calls and the maximum and minimum times for particular
calls. Since the D2H cudaMemcpy operation is used twice (lines 76 and 80), it is easy to
infer that the first memcpy of 10,000 words took 26.232 ms and the second copy of a single
word took 704 ns. The remaining 4 steps each occurred once so their times are unambiguous.
These six times are shown in bold in Example 10.3. The sum of these times is 244.838 ms
which is in good agreement with the host reported GPU time of 245.732 ms. However, it
would have been quite tedious to have written host code to obtain the times for the individual
steps – nvprof makes this task simple. Another point to note is that the host reported time for
running gpujob is 655.452 ms, about 200 ms more that the time reported when running
without nvprof. The increase is due to the overhead of running the nvprof profiler.
One apparently puzzling feature of the profile is the value of 245.61 ms reported for the
2 API cudaMemcpy operations. These are the same operations as reported for the GPU but
apparently taking 10 times longer. This puzzle is solved by inspecting the actual timeline for
the process involved when it becomes clear that the GPU values are timed from the
beginning to end of execution for the process whereas the API values are the time between
the process being added to the CUDA stream queue and the completion of the process.
Another feature to note is that both cudaMalloc and cudaDeviceReset are relatively
expensive operations for short jobs taking 66 and 44 ms respectively.
For larger projects nvprof can generate a great deal of output unrelated to the features that
you wish to explore. A simple way to limit profiling to a block code is to warp the statements
cudaProfilerStart(); and cudaProfilerStop(); around that block. The header
file cuda_profiler_api.h contains the necessary definitions and must be added to the
include statements at the head of your code. Example 10.4 shows how to restrict profiling to
a few lines of the CUDA code. The more limited results of running the modified program are
also shown.
sanet.st
. . .
#include "cuda_profiler_api.h"
. . .
75 gpu_log<<<blocks,threads>>>(dev_logs,terms,steps,step_size);
76 cudaMemcpy(logs,dev_logs,steps*sizeof(float),
cudaMemcpyDeviceToHost);
cudaProfilerStart();
// 2 step reduce
77 reduce_warp_vl<<<blocks,threads >>>(dev_sums,dev_logs,steps);
78 reduce_warp_vlB<<< 1,threads >>>(dev_tot,dev_sums,
threads);
79 float gpuint = 0.0f;
80 cudaMemcpy(&gpuint,dev_tot,1*sizeof(float),
cudaMemcpyDeviceToHost);
81 cudaDeviceSynchronize();
cudaProfilerStop();
. . .
The header file cuda_profiler_api.h is actually quite limited; it just contains the two
start and stop functions used above.
The NVIDIA Tools Extension Library (NVTX) is a much more elaborate suite of
functions allowing you to customise the output for the visual profiler in many ways.
However, these tools are quite verbose and are beyond the scope of this book. More
information can be found in NVIDIA Profiler Users Guide https://docs.nvidia.com/pdf/
CUDA_Profiler_Users_Guide.pdf.
Running nvprof to generate a profile file and then launching NVVP to view the timeline.
As indicated the visual profiler can be launched from the command line specifying the
name of an input file; the timeline will then be displayed as shown in Figure 10.1. If no
arguments are specified, nvvp will still launch and you will be presented with a dialogue
334 Tools for Profiling and Debugging
sanet.st
Figure 10.1 NVVP timelines for gpulog example: 100 ms per step
enabling you to choose the exe file to run and to supply the command line arguments. The
program will then be automatically run to collect a profile for display. The visual profiler can
also be launched from a desktop shortcut in the usual way.
Figure 10.1 shows the nvvp output for the above job; we have zoomed the timeline view
a couple of steps using the control at C. The main window shows the timelines for all the
CUDA processes run in the job. The three most interesting sections are the runtime API (D),
the cudaMemcpy operations (E) and the kernels (F). It is possible to click on any of the
displayed tasks to view many details in the properties window (G). In the figure, the details
of the gpu_log kernel call (line 75 in Example 10.1 ) are shown and we can see that this
kernel call took 222.049 ms. The two reduce kernels which are much faster appear as vertical
bars on this view.
In the bottom left-hand corner, we can see the session settings window giving details of
the exe file and program arguments. Both of these can be changed manually and then the
session can be rerun using the control at B. I find this very convenient when exploring effects
of parameter changes. It is also possible to recompile the program, for example, with Visual
Studio, while the nvvp window is open and then use the control at B to see results with the
new version of the exe file.
The runtime API line (D) shows kernel and cudaMemcpy events starting when they are
added to the CUDA stream pipeline and finishing when they are complete. In this example
kernels, which are non-blocking on the host, are added at lines 75, 77 and 78 and cudaMemcpy
10.3 Profiling with the NVIDIA Visual Profiler 335
Figure 10.2 NVVP timelines for gpulog example: 100 µs per step
operation, which is blocking on the host, is added at line 80. Thus from the API point of view
the first cudaMemcpy operation begins at the same time as the first kernel. However, this
operation does not actually start until the first kernel has completed because operations on a
given CUDA stream are blocking with respect to each other on the GPU. This explains the
apparently long duration of this operation in the API section of the nvprof output. In more
complicated situations with multiple asynchronous CUDA streams the API information is a
useful aid for checking that the behaviour is as expected. Figures 10.2 and 10.3 show portions
of the same NVVP timelines but with the main window zoomed in further to show better detail
of the short duration reduce kernels and 4-byte cudaMemcpy operation.
In detail Figures 10.1–10.3 show portions of the timelines for gpulog generated using
the Nsight Visual Profiler NVVP for Windows 10. Figure 10.1 shows about 0.9 seconds of
execution giving an overview of all steps. Interesting features are:
A: The executable program and command line arguments are entered in this the panel.
B: Once set up in A the job can be run or rerun using the tool at B.
C: Timelines can be expanded in or out using the controls at C.
D–F: Timelines for runtime API, cudaMemcpy operations and kernels are shown separately
here. Each different kernel has a separate timeline.
G: Clicking on any object allows you to inspect its properties. Here we show the properties
for the gpu_log kernel including the run time of 222.048 ms. It is possible to zoom in
and then to inspect details of short duration events using the controls at C as illustrated in
Figures 10.2–3.
Figure 10.2 shows the NVVP timeline zoomed by a factor of 100 to show details of the
reduce kernels. The full width is now about 0.9 ms. The lower panels show details of the
336 Tools for Profiling and Debugging
Figure 10.3 NVVP timelines for gpulog example: 2.5 µs per step
launch and execution times of the reduce_warp_vl kernel. The reported execution times
for the two runs of this kernel are 158.98 µs and 1.568 µs respectively.
Figure 10.3 shows an even deeper zoom; the fullwidth is now about 12.5 µs. The figure
shows detail of the reduce_warp_vlB kernel and the subsequent 4-byte cudaMemcpy
operation. This memcpy is reported to take 608 ns.
Both nvprof and NVVP are first generation NVIDIA products and work with early GPUs.
Recent GPUs of CC≥6.0 have extra hardware features to support profiling and in
2018 NVIDIA introduced a next generation pair of tools to better support these GPUs.
The new tools are Nsight Systems and Nsight Compute and these are discussed next.
sanet.st
Some of the other options shown in Figure 10.4 are primarily of interest to games
developers; these include DX11, DX12, OpenGL and Vulcan trace. The WDDM driver
trace is also mainly of interest for graphics applications. The NVTX option is of interest and
is discussed below. The ETW (Event Tracing for Windows) is somewhat specialised. More
details of all the tracing options can be found at the NVIDIA web page: https://docs.nvidia
.com/nsight-systems/tracing/index.html.
The results of running Nsight Systems with the options in Figure 10.4 are shown in
Figures 10.5. and 10.6.
Figure 10.5 shows the Nsight Systems timeline display. Many timelines are shown, of
which, those at A, B and C are directly relevant to the gpulog CUDA job. The CUDA API
calls are shown at A, the kernel and cudaMemcpy calls at B and the device memory
allocation is shown at C. The timing detail of the events in B can be displayed in D by right
clicking as indicated. More details of individual items can be displayed at E. To zoom in to
the timeline, drag the cursor horizontally. Note the tool at F simply enlarges everything
rather than expanding the timeline. Figure 10.6 shows the timeline from Figure 10.5
expanded by factor of ~6 105 to show detail for reduce kernels.
Comparing the results from NVVP and Nsight Systems we can see that the complete
timelines in Figures 10.1 and 10.5 contain similar information. This is also true for the
expanded timelines. Nsight Systems is a suitable replacement for the NVVP. There is one
caveat to this, whereas nvprof is a good command line alternative to the GUI driven NVVP,
there is no equivalent command line interface for Nsight Systems on Windows. In Linux the
nsys command line tool is available as a command line driven profiler.5 Nsight Systems
will also work well with NVTX code annotations.
338 Tools for Profiling and Debugging
sanet.st
Figure 10.10 GPU Speed of Light: roofline plot for two kernels
10.6 Nsight Compute Sections 341
Figure 10.9 shows the compute and memory use for the gpu_log (upper bar) and
reduce_warp_vl (lower bar) kernels. It is clear that the first kernel is compute bound
and the second kernel is memory bound.
Figure 10.10 shows the floating-point performance for the two kernels as points on a
“roofline” plot. Right clicking on a point gives a mini-summary box showing Flops per byte
and Flops per second. The gpu_log kernel achieves a compute performance of 2.84
TFlops/sec and 3.67 kFlops/byte whereas the reduce_warp_vl kernel achieves 107.8
GFlops/sec and 0.26 Flops/byte.
Figure 10.12 Memory workload analysis: flow chart for gpu_log kernel
Figure 10.12 Shows the Nsight Compute memory workload analysis chart for the
gpu_log kernel. A flow of 64 MB from L1 cache to system memory can be seen.
10.6.4 Scheduler
sanet.st
Statistics
On-screen description: “Summary of the activity of the schedulers issuing instructions. Each
scheduler maintains a pool of warps that it can issue instructions for. The upper bound of
warps in the pool (Theoretical Warps) is limited by the launch configuration. On every cycle
each scheduler checks the state of the allocated warps in the pool (Active Warps). Active
warps that are not stalled (Eligible Warps) are ready to issue their next instruction. From the
set of eligible warps the scheduler selects a single warp from which to issue one or more
instructions (Issued Warp). On cycles with no eligible warps, the issue slot is skipped and no
instruction is issued. Having many skipped issue slots indicates poor latency hiding.”
Figure 10.13 shows statistics for both gpu_log and reduce_warp_vl. From the Slot
Utilisation issues boxes we see the warp scheduler issues one instruction every 1.7 cycles for
the gpu_log kernel and one every 22.9 cycles for the reduce_warp_vl kernel. The
second kernel thus has very poor memory latency hiding.
Figure 10.14 Warp state statistics: showing data for two kernels
When executing a kernel with mixed library and user code, these metrics show the combined
values.”
The detailed information in Figure 10.14 includes recommendations which might help
experienced GPU programmers improve the design of their code.
344 Tools for Profiling and Debugging
This section which has no additional charts reports the launch configurations and a few
derived quantities. For this example, we get the values shown in the box.
Table 10.1 Tuning the number of thread blocks for the gpulog program
gpu_log reduce_warp_vl
Blocks Waves TFlops/sec time ms TFlops/sec Time µs
256 1.78 2.847 235.73 0.107 160
512 3.56 3.067 218.79 0.110 160
1024 7.11 3.143 213.47 0.114 163
288 2 3.161 212.31 0.108 160
576 4 3.193 210.16 0.101 176
1152 8 3.190 210.35 0.117 160
The value reported for threads which is simply the product of the grid size and block size
is as expected and the number of registers per thread has been decided by the compiler and is
safely below 32, the limit for full occupancy. However, the value for waves is interesting and
helpful. In CUDA a wave of threads refers to the set of all threads launched on all SMs at a
given time. For full occupancy this is either 2048 or 1024 times the number of SMs on the
GPU depending on the CC level of the GPU. For current GPUs with CCs in the range 3.5 to
8.0 only Turing GPUs with CC=7.5 have SMs which support up to 1024 resident threads; for
all others the maximum is 2048.
The Turing RTX 2070 GPU used here has 36 SMs so the wave size is 1024 36 =
36864 threads and the launch configuration corresponds to 65536/36864 = 1.78 waves. This
means two waves will be launched to process the job; the first wave achieves full occupancy,
but the second wave only achieves 78% occupancy. This may or may not affect performance
depending on the kernels. Nsight Compute allows you to easily experiment with this and
some results for the gpulog program are shown in Table 10.1.
As is clear from the table using a launch configuration with an integer number of waves is
significantly better than our first choice for the gpu_log kernel. Using exactly four waves
gives a performance of 3.19 TFlops/sec, a gain of 12%. A final wave with less than full
occupancy is sometimes referred to as the “tail” by CUDA programmers, and while I was
always aware that such a tail exists using a standard launch configuration like 256 256
I had assumed that any effect on CPU bound kernels would be small. After all a single GPU
SM unit only has enough hardware to process two or four warps at a given instant, the
advantage of full occupancy is latency hiding – it would appear that full occupancy is hiding
more than just the latency of external memory access. We have experimented with other
kernels used in this book and find that using an integer number of waves for the number of
thread blocks instead of a power of 2 does not always make a significant difference. We
recommend experimenting with this for each of your GPU projects.
10.6.8 Occupancy
On-screen description: “Occupancy is the ratio of the number of active warps per multipro-
cessor to the maximum number of possible active warps. Another way to view occupancy is
the percentage of the hardware’s ability to process warps that is actively in use. Higher
occupancy does not always result in higher performance, however, low occupancy always
reduces the ability to hide latencies, resulting in overall performance degradation. Large
346 Tools for Profiling and Debugging
discrepancies between the theoretical and the achieved occupancy during execution typic-
ally indicates highly imbalanced workloads.”
This section gives the achieved kernel occupancy and has charts showing how this might
vary with registers per thread and shared memory use. Figure 10.16 shows the chart produce
for the gpulog program.
Figure 10.16 Occupancy: theoretical and achieved values for gpulog program
10.7 Debugging with Printf 347
Figure 10.17 Source counters: source and SASS code for gpulog program
In summary, Nsight Systems and Nsight Compute are really powerful tools for investi-
gating the performance of your GPU code. We highly recommend their use in situations
where you suspect your code is not performing as well as it might.
Since fast code is no use if it does not actually work, it is now time to consider debugging.
debugging can be done with code compiled in the optimised release mode the other more
interactive methods require the code to be compiled in debug mode.
30 checkCudaErrors( cudaSetDevice(0) );
40 cx::ok( cudaSetDevice(0) );
50 kernel<<<blocks,threads>>>(a,b,c);
51 cx::ok( cudaGetLastError() ); // check for launch error
52 cudaDeviceSynchronize();
53 cx::ok( cudaGetLastError() ); // check for run time error
returned by the CUDA function cudaSetDevice(). This function will return an error code if
there is no CUDA GPU available; hence it is good practice to check for this error.
10.8 Debugging with Microsoft Visual Studio 349
○ Line 11: Here we check cudaStatus, if it is equal to cudaSuccess then there is a CUDA
device present and we can bypass the if clause and proceed with the rest of the program.
○ Line 12: An error has occurred so print a warning message.
○ Line 13: We cannot continue after this error so the code uses a go to statement to jump a shared
fragment code that performs a tidy-up and then exits. The use of go to in this way dates from the
FORTRAN II era and should never be used in modern software. It is always possible to avoid go
to, for example, by encapsulating sections of code in a sub-function and returning with an error
code. The routines in cxbinio.h are an example of this.
• Lines 20–24: This is a slightly improved version of the above. We have used
cudaGetErrorString in the printf on line 22 to convert the value of cudaStatus to a
readable description of the error. We have also replaced the go to with exit(1) which ends the
program and returns an error code of 1 to the operating system. For simple programs we can rely on
the operating system to clean up allocated resources in the case of fatal errors, but this should not be
relied on for high-quality production code. Using exit in this way may also prevent profiling or
debugging tools such as Nsight Compute from working properly.
• Line 30: This is a more compact version of lines 20–24. The CUDA SDK uses the function
checkCudaErrors for checking return codes in many of the SDK examples. It performs the
same task as lines 20–24 including exiting the program if an error is found. The function is defined
in cuda_helper.h.
• Line 40: This is similar to using cudaCheckErrors except that cx::ok as defined in cx.h is
used and hence is easy for you to edit and change the exit(1) to return 1 for situations where
you may need to continue after an error has been detected. It is also slightly less verbose improving
the readability of your kernel code.
Lines 50–53 These lines show how to use the same method with kernel code, given that a kernel
cannot directly return an error code to the caller. However, the CUDA system maintains a flag
containing the last reported runtime error from either host or device code. The flag can be read using
the cudaGetLastError function as shown in this fragment of code.
○ Line 50: Here we launch a kernel and want to check for errors.
○ Line 51: Here we check for kernel errors using cx::ok with cudGetLastError – this will
now behave just like the code in lines 30 or 40 albeit with one subtle difference. The kernel launch
is asynchronous with the host execution so probably this call will not pick up any errors that occur
during kernel execution, except possibly at the very start. What it will do is to pick up any errors
from the kernel launch itself, for example, insufficient shared memory or a thread block size which
is too large. A side effect of calling cudaGetLastError is to reset the internal error to
cudaSuccess so repeated calls will only detect a given error once. (There is also a
cudaPeekAtLastError which returns the most recent error without resetting the error state.)
○ Line 52: The cudaDeviceSynchronize call now suspends host execution until the kernel
launched in line 50 has completed. After this call the last of any errors reported by the kernel will
be available to check.
○ Line 53: Here we check for errors during kernel execution.
To debug in Visual Studio, all that is needed is to select debug mode compilation, set one or
more break points and then press F5. This is illustrated in Figure 10.18 which also explains
how to set the command line arguments and make the desired project the target for debugging.
The procedure for starting a debugging session in Visual Studio as shown in Figure 10.18
is as follows:
1. Select debug mode at A.
2. Set one or more break points by clicking to the left of appropriate statement line numbers.
The selected statements are shown by markers indicated by B for lines 116 and 134 in
the figure.
3. Press F5 to start the session.
4. If you have more than one project in your solution then, before starting debugging, the
project concerned must be set as the default startup project by right clicking on the project
name at C. The default startup project name is shown in bold.
5. Right clicking on the project name also allows you to open the project properties pages as
shown at D.
6. The debugging tab at E is where you set the command line arguments for the program
being debugged as shown at F.
Once the debugging session has started, the program will run as far as the first break point
and then wait for user input. At this point the debug menu option at G in Figure 10.18 will be
populated with many options as shown in Figure 10.19.
10.8 Debugging with Microsoft Visual Studio 351
Interesting commands are F10 and F11 shown at A; these run the next line in your code
with the option to step into or over called functions. Note that a called function is still
executed even if stepped over and this applies to CUDA functions as well as host functions.
You can also open the Locals window at B which displays the current values of local
variables. These values are dynamically updated as you step through your code, with newly
changed values being highlighted in red. The current value of a variable can also be
inspected by hovering over that variable with the mouse pointer. This is obviously very
powerful and an alternative to using printf.
Pressing F5 shown at D will cause the program to run on until the next breakpoint is
reached (this could be at the same place in the code, if inside a loop). The line number
corresponding to the current break point is shown at C.
A more in-depth treatment of the Visual Studio debugger can readily be found on the web,
for example, https://docs.microsoft.com/en-us/visualstudio/debugger/getting-started-with-
the-debugger-cpp?view=vs-2019.
Figure 10.20 shows the result of stepping the through the gpulog code up to the second
break point at line 134 (at A). The Locals window now displays values of variables
calculated up to this point. We can see at B that both the log2 and log2_gpu variables
have correct values indicating that the GPU code has run successfully by this point. The
host array logs will also have been filled by the first cudaMemcpy. The contents of large
arrays can be inspected by opening a Watch window using the menu option at B in
Figure 10.18 and then typing in the array element to view, this is illustrated at C in
352 Tools for Profiling and Debugging
Figure 10.19. Note simple index arithmetic using valid variables is allowed, for example,
logs[steps/2] is fine.
break point, however once in a kernel you can then use F11 to step through kernel code just
like host code.
A debugging session for our gpulog example is shown in Figure 10.22. In this example
the program has been run up to the break point at A and then F11 has been used to step
through the kernel code up to line 24 in the logsum __device__ function. We have
actually then stepped through the for loop a few iterations so that k is equal to 4 at the point
shown in the figure. The current values of k and other variables are shown in the Locals
window at C. Note that the CUDA built in variables are included here and we are looking at
thread 99 in block 0. By default, the CUDA debugger will show values for thread 0 in block
0 but this can be changed by using the CUDA Warp Info and Lanes windows which are
shown at D and E. To select a particular thread one can just click on an element of the matrix
shown at F in the Warp Info window. The selected thread will have a different colour as
can be seen for thread 99 near F in our figure. Alternatively, one can select a particular warp
and lane by clicking in the left-hand columns in the Warp Info and Lanes windows. The
selected rows are indicated by the arrows at D and E.
More information on CUDA debugging with Visual Studio can be found on the NVIDIA
website, for example, at https://docs.nvidia.com/nsight-visual-studio-edition/index.html.
10.10.1 Cuda-memcheck
Another useful tool is the NVIDIA CUDA memory checker cuda-memcheck.exe which
can help diagnose hard-to-find array-addressing errors in kernel code. Such errors often
manifest themselves by crashing the program. Sometimes the cx::ok return code checker
will also pick up these errors but that is not always true and cuda-memcheck provides
better diagnostics. The cuda-memcheck program is like nvprof in that it is a command
line wrapper for running CUDA programs. As an example, let us introduce a bug in our
Example 10.1 gpulog code as shown in Example 10.6 below. The new kernel mem_put
writes to one element of the input array and the host code uses the user-defined parameter
loc to specify which element to overwrite.
The results of testing different values of loc are also shown in Example 10.6. For case
(a), the code is run without any checks, no error is reported for any value of loc up to 231‒1.
Correct results are still obtained but the execution time for the GPU code has increased. In
case (b) we use conventional error checking with the cx::ok wrapper around the
cudaDeviceSynchronize in line 81. In this case we do detect errors but only when
loc is about a factor of 8 times larger than the valid index range. The nature of the error is
correctly reported but the offending kernel is not identified. The program is terminated when
the first error is detected.6 In case (c) using cuda-memcheck the error found when loc ≥
steps+blocks, the offending kernel is identified and execution continues.
10.10 Memory Checking 355
Note that in case (c) using cuda-memcheck we only detect errors when loc exceeds
steps+blocks-1, then the offending kernel is correctly identified and program execution
is allowed to continue. Even cuda-memcheck is not perfect for finding out-of-range index
errors. We conjecture that in our example the two large GPU memory allocations for
dev_logs and dev_sums made in lines 70 and 71 of Example 10.1 are actually allocated
as contiguous blocks with dev_sums after dev_logs. Thus, it is only when the index
used in dev_logs exceeds the combined size of both arrays that an out-of-range error is
flagged by the hardware.7
Note the GPU execution times are always increased when using cuda-memcheck
irrespective of whether an error is detected but this is not really a problem as no code
changes are necessary when using this tool and you would not use cuda-memcheck for
production runs of code.
Endnotes Chapter 10
1 This final cudaDeviceSynchronize is not strictly necessary as the previous cudaMemcpy (D2H)
is always blocking on the host. (But note that H2D copies with data sizes of less than 64K might not be
blocking on the host.)
2 On Linux you may be able to use the compiler switch -ftz or on Windows with the Intel Cþþ
compiler the /Qftz switch.
10.10 Memory Checking 357
3 You may get a missing dll error message when you first try this, e.g. missing cupti64_2020.1.1.dll.
On Windows this can be fixed by adding %CUDA_PATH%/extras/CUPTI/lib64 to the system
PATH.
4 You may need a (free) developer account to access these pages – you should not have a problem with
this; just follow the instructions to register for one.
5 It seems that differences in CUDA support tools for Windows and Linux are emerging. The Windows
tools being more focused on gaming and graphics while the Linux tools are more focused on HPC.
6 This behaviour can be modified by editing the definition of cx::ok in the cx.h include file.
7 If we are being picky, then there is also a third allocation of just one word in line 72 of example 10.1 (b).
Maybe that allocation is not contiguous and so does not contribute to the size of larger contiguous block.
We can actually print the GPU pointers (using %lld) to check this and indeed find that whereas the first
two allocations are contiguous the third allocation is offset by an additional 96 bytes. This in turn is
because each cudaMalloc array is aligned on a 512 byte boundary and the size of the dev_sums array for
288 blocks is not a multiple of 512.
11
Tensor Cores
The Volta CC 7.0 generation of GPUs, launched in June 2018, included new Tensor Core
hardware to accelerate matrix multiplication. The capabilities of this hardware were
increased in June 2020 with the launch of the Ampere GPU generation. Although intended
specifically for applications in artificial intelligence, we feel this hardware may be more
generally useful and is well worth looking at even if AI is not your primary focus.
358
11.1 Tensor Cores and FP16 359
The Turing generation of GPUs with CC = 7.5 were launched shortly after Volta in August
2018 and have tensor cores upgraded to include processing of 4-bit and 1-bit integers. In
June 2020, NVIDIA launched its CC 8.0 Ampere generation of GPUs which include
upgraded third generation tensor cores. These are a factor of 2 or more faster and support
two additional short floating-point types, BF16 and TF32. Both new types have an 8-bit
exponent, the same as used in FP32 and either a 7-bit or a 10-bit fraction. The hardware
layout of all these floating-point numbers is shown in Figure 11.1. The BF16 format is a
variant of the IEEE standard FP16 format allowing more dynamic range at the expense of
reduced precision. The TF32 format is a variant of the IEEE FP32 format with the same
exponent range but with fewer fraction bits. Importantly, standard FP32 values can be used
as inputs for the a and b matrices and will be internally converted to TF32 values. This
greatly simplifies porting existing matrix code to tensor cores because format conversions
on the host are not required. Ampere also has other enhancements closely targeted at
machine learning or AI applications; we recommend the interested reader to look at the
NVIDIA nvidia-ampere-architecture-whitepaper document which can be found in PDF
form at www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architec
ture-whitepaper.pdf.
In Figure 11.1 the FP64, FP32 and FP16 are standard IEEE formats but TF16 and BF16
are proprietary to NVIDIA. Only FP16 is supported on first generation CC = 7 GPUs (Volta
and Turing) while all formats are supported on third generation CC = 8 (Ampere) GPUs.
Although FP32 and FP64 can be used for the a and b matrices on third generation tensor
cores there is no performance gain.
360 Tensor Cores
template<typename Use, int m, int Creates objects of type fragment. The parameter
n, int k, typename T, typename Use must be one of the keywords
Layout=void> class fragment; matrix_a, matrix_b or
sanet.st accumulator. Layout is one of
row_major or col_major.
void load_matrix_sync Load a 16 16 tile of data into fragment a from
(fragment<. . .> &a, const T* array at mptr using a stride of ldm between
mptr, unsigned ldm, layout_t rows or columns depending on layout. This
layout); version is for accumulator fragments
void load_matrix_sync As above but for matrix type fragments with
(fragment<. . .> &a, const T* known layouts.
mptr, unsigned ldm);
void fill_fragment(fragment<. . .> Set all elements of fragment a to the value v.
&a, const T& v);
void mma_sync(fragment<. . .> &d, This evaluates D=A*B+C for the fragments. If
const fragment<. . .> &a, const satf is true overflows are set to
fragment<. . .>&b, const max_norm and nan is replaced by 0.
fragment<. . .> &c, bool
satf=false);
void store_matrix_sync(T* mptr, The fragment a is written to a 16 16 tile in the
const fragment<. . .> &a, unsigned array at lptr using a stride of ldm.
ldm, layout_t layout);
operator [] e.g.for(k=0; Elements of fragments can be accessed using
k<frag_a.num_elements; k++) standard array index notation, but fragments of
frag_a[k] = p*frag_a[k] + q; different types cannot be mixed in expressions.
All threads in the warp must perform the same
for loop operation.
Note these functions must be called with the same arguments by all threads in a warp which implicitly
share the same set of fragments.
11.2 Warp Matrix Functions 361
fragment objects of sizes given by three integers n, m and k. NVIDIA’s convention is that if c
and d are m × n matrices, then a is n × k and b is k × n. The warp matrix functions in
fact support three geometries {m,n,k} = {16,16,16} or {32,8,16} or {8,32,16}. In
other words, k must always be 16 but m and n can either both be 16 or one can be 8 while
the other is 32. The include file mma.h defines the warp matrix functions and a fragment
class. Functions are provided to fill, multiply and store fragments.
Some functions take an optional last argument layout. If specified this must be one of
the keywords row_major or col_major. Layout is specified when declaring fragments
for the a and b matrices but not for c and d. However, layout is necessary when c or d are
used with the load_matrix_sync or store_matrix_sync functions.
Example 11.1 shows how to perform matrix multiplication using tensor cores. It is similar
to our early Example 2.13 in that both use tiled matrix multiplication. The difference is that
in 2.13 each thread in a thread block evaluates one element of a tile whereas in Example 11.1
one warp evaluates one tile; thus, here a thread block of 256 threads evaluates eight 16 16
tiles instead of one. In Example 2.13 we used one thread block for each tile in the output
matrix c whereas in Example 11.1 we use warp-linear addressing to span all the tiles in c.
The code assumes that the matrix dimensions are multiples of 16 so that edge effects can be
ignored. The tile sizes for the tensor core calculations are set explicitly to 16 16 rather than
being parameterised using NVIDIA’s m, k and n parameters to help with the clarity of
the code.
Example 11.1 matmulT kernel for matrix multiplication with tensor cores
05 #include "mma.h"
06 using namespace nvcuda;
. . .
10 __global__ void matmulT(r_Ptr<float> C, cr_Ptr<half> A,
cr_Ptr<half> B, int Ay, int Ax, int Bx)
11 { // warp rank in grid
12 int warp = (blockDim.x*blockIdx.x+threadIdx.x)/warpSize;
13 int cx = warp%(Bx/16); // (x,y) location if active tile
14 int cy = warp/(Bx/16); // for current warp in C matrix
○ Line 22: Store next a tile in a_frag, the values are read from the argument a which points to
global GPU memory. Note the entire warp participates in this operation and control passes to the
next statement only when all threads in the warp are ready.
○ Line 23: Like the previous line but stores the next b tile in b_frag.
○ Line 23: This is where the tensor cores perform matrix multiplication on the 16 16 a and b tiles
column of b.
• Line 29: At the end of the process we copy the result in c_frag to the output array c. Note all
threads in the warp specify the same address in c.
The result shown in the figure for 1024 1024 matrices shows a speed-up of about a
factor of 4 compared to the standard GPU version in Example 2.13 and a factor of over
8000 compared to using unoptimised code with three for loops on a single host CPU. The
results agree to 4-significant figures for matrices filled with random numbers taken from a
uniform distribution between 0 and 1. The performance of Example 11.1 is about 3.8
TFlops/sec. This excellent performance can be improved further by noting that a and b tiles
read from GPU memory in lines 23–24 are only used to evaluate a single c tile and are
actually reread Ax times by different warps. We can use shared memory to hold a single a
tile which can then be used by the 8 warps in a single thread block of size 256 to multiply
8 different b tiles. This version is shown in Example 11.2 where we also buffer the 8 b tiles
in shared memory.3 This code uses thread blocks of size 256 and firstly all the threads in a
thread block cooperatively set up one tile from the matrix a in shared memory and then
secondly each of the eight warps separately loads a different tile from the b matrix into the
shared memory.
Example 11.2 matmulTS kernel for matrix multiplication with tensor cores and
shared memory
31 for(int k=0;k<Ax/16;k++){
32 as[idy*16+idx] = A[Atile_pos+idy*Ax+idx]; // 256 threads here
33 __syncthreads();
34 for(int p=0;p<8;p++) bs[wb][p*32+tyw*16+txw] =
B[p*2*Bx+Btile_pos+tyw*Bx+txw]; // 32 threads fill tile
35 __syncwarp();
36 wmma::load_matrix_sync(a_frag,&as[0],16); // load A as
16x16 tile sanet.st
37 wmma::load_matrix_sync(b_frag,&bs[wb][0],16);// load B as
16x16 tile
38 wmma::mma_sync(c_frag,a_frag,b_frag,c_frag); // C = A*B + C
39 Atile_pos += 16; // move along A row
40 Btile_pos += 16*Bx; // move down B cols
41 }
42 wmma::store_matrix_sync(&C[(cy*Bx+cx)*16], c_frag,
Bx,wmma::mem_row_major);
43 }
• Lines 20–25: These lines have been added and are used to organise the loading of the shared
memories.
○ Line 20: The variable wb is set to the rank of the current warp in the thread block. For
256 threads wb will be in the range [0,7] and is used as the leftmost index of the shared
array bs.
○ Line 21: The variable trw is set to the current thread’s rank in its warp.
○ Lines 22–23: The 32 warp-local variables trx and try are set in the ranges 0–15 and 0–1
respectively. These variables are used to address rows of the 16 16 b tiles in line 34.
○ Lines 24–25: The 256 block-wide variables idx and idy are both set in the range 0–15 and are
warp in the thread block. Since now there are only 32 cooperating threads an 8-pass loop
is required.
○ Line 35: This __syncwarp is necessary before using the contents of bs[wb].
○ Lines 36–40: These lines are almost identical to lines 23–27 of the previous example. We use the
warp matrix functions to load a and b tiles to a_frag and b_frag and then accumulate their
product to c_frag. The only change is that in lines 36 and 37 we load a_frag and b_frag
from shared memory instead of device global memory. Note the strides are also changed to 16 in
these lines.
• Line 42: Copy the completed c tile to global memory; this line is the same as line 29 of the
previous example.
The performance of our matmulTS kernel at around 6 TFlops is approaching the 8.8
TFlops we found in Chapter 2 using the optimised cuBlas library routines. The matmulTS
kernel uses a fixed thread block size of 256, but this can easily be changed to 512 or 1024
which would allow the a tile to be shared by 16 or 32 warps. In fact, the number of warps
can be made a template parameter. We have experimented with this and find that on our
platform eight warps gives the best performance. Performance could be further improved if
we find a way of sharing the b tiles as well as a tiles between warps. The CUDA SDK 11.0
example cudaTensorCoreGemm does just this and archives around 15 TFlops on my
platform. However, that code is more complicated than our example.
Table 11.2 Tensor cores supported data formats and tile dimensions
For a description of the specialised float types bf16 and tf32 and the sub-byte types u4, s4 and
b1, see the most recent CUDA C++ Programming Guide.
Thus, to partially sum a large set of numbers, we can modify the matrix multiplication in
Example 11.1 by setting all elements of fragment a to 1 and stream sets of 256 numbers to
fragment b for each warp and accumulate the column sums in c. At the end of the process,
we simply need to sum the 16 values along a row of the accumulator c to get that warp’s
contribution to the sum. Note we can choose any row as all the 16 rows of c will contain the
same column sums. Of course, we have done 16 times more work than is strictly necessary
for reduction so this would be a crazy method without very fast matrix multiplication. The
resulting code is show in Example 11.3.
sanet.st
Description of Example 11.3
• Line 10: The declaration of reduceT uses the same arguments as the previous reduce kernels
discussed in Chapters 2 and 3. The kernel argument sums is used for the results and is of size equal
to the number of thread blocks; it will hold the partial sums from those thread blocks on exit. The
array data holds the values to be summed and contains n elements specified by the third argument.
A significant difference is that here data is declared as half whereas previous reduction kernels use
float.
• Line 12: A 2D shared memory array fs is declared to hold one 256-word tile of floats for each warp.
Dynamic shared memory is used here so that the thread block size can be varied for tuning purposes.
• Lines 13–15: We use cooperative groups in this kernel as the warp_shfl function is used in lines
36–39.
• Lines 16–18: Here we find warp and thread ranks to use in the subsequent calculation.
• Lines 19–20: The variables wpoint and wstep are initialised here and are used to implement
warp-linear addressing of the array data in chunks of 256 words in lines 28–32.
• Lines 22–25: Here the a, b and c fragments to be used in the calculation are declared as in Examples
11.1 and 11.2.
• Line 26: This line differs from before as now we now fill a_frag with the value 1 instead of
matrix data.
• Lines 28–32: This loop implements warp linear addressing to step through the input array data with
each warp processing a chunk of 256 words on each step through the loop. There is a tacit
assumption that the array size n is a multiple of 256.
○ Line 29: Loads the next chunk of data into b_frag; note we use a stride of 16 here to load
○ Line 31: Increments the data pointer wpoint by the amount of data used by all warps in this pass.
• Line 33: The final accumulated values of the column sums are copied to the shared memory 2D
array fs here. The first index wb is the rank of the current warp in the thread block.
• Lines 35–39: Here we perform warp-level reduction on the copy of the first 16 elements (i.e. the first
row) of each warp’s c_frag array. The method is the same as used in Chapter 3. We note that on
GPUs with CC ≥ 8 we could effectively use the new warp-level reduce function to replace
these lines.
• Line 40: Here the warps add their contributions to elements of the output array sums.
The timing results shown at the end of the example are for the reduction of 228 values. One calculation
was used for the host CPU calculation and 1000 repeated calculations were used for reduceT and the
comparison reduce7_vl kernels.
Another interesting detail in Example 11.3 is that in line 29 we are reading a 512-byte
chunk of data from GPU main memory. In this case neither using shared memory nor using
the vector-loading technique improves performance. This is because the load_matrix_
sync call is performed cooperatively by all threads in the warp and is designed to load this
size chunk of data very efficiently. Note that because the loaded data is only used by one
warp, shared memory is not helpful.
The new reduceT kernel runs twice as fast as our previous best reduce7_vl kernel.
This improvement is in fact due to using type half instead of float for the input array
data which halves the number of bytes that we have to read from GPU main memory. This
is not a trivial observation, NVIDIA’s hardware support for the new type is because it is used
by tensor cores and AI applications where the limited precision is usually adequate. This
suggests we could just use the method from Chapter 3 with the new half data type and get a
similar speed-up. Example 11.4 shows the resulting reduce7_vl_half kernel.
Example 11.4 reduce_half_vl kernel for reduction using the FP16 data type
17 float v = 0.0f;
18 half v8[8] = {(half)0.0f,(half)0.0f,(half)0.0f,
(half)0.0f};
19 for(int tid = grid.thread_rank(); tid < n/8;
tid += grid.size()) {
20 reinterpret_cast<int4 *>(v8)[0] =
reinterpret_cast<const int4 *>(data)[tid];
370 Tensor Cores
• Lines 10–16: The only change from Example 3.7 is that the input data array data is declared as
half not float.
• Line 17: Here we declare float v which will be used by a thread to accumulate its contribution to
the overall sum. In Example 3.7, this was the float4 variable v4 which has a size of 128 bytes
suitable for optimal vector loading and support for vector arithmetic operations.
• Line 18: Declare v8 as an 8-elment vector of type half for each thread. This will play the same role
as v4 in Example 3.7 but does not have overloaded operators to support arithmetic operations.
• Lines 19–22: This for loop accumulates contributions for each thread using vector loading to load
eight elements of sums in one step.
○ Line 20: We use reinpterpret_casts to the int4 data type to copy 128 bytes from sums
to v8 in a single operation.
○ Line 21: Here we add the eight values in v8 to v. We use a hybrid approach to precision by using
the intrinsic __hadd to add pairs of values from v8 to get their sum as another half value and
then adding this to the float accumulator v. There are many variations one could try here but
adding too many half values to another half accumulator is likely to rapidly lead to a loss of
precision. The cast (float) could be replaced by the intrinsic function __half2float() but
hopefully the compiler is performing this sort of optimisation for us.
• Lines 23–30: This last section of code is identical to lines 20–27 of Example 3.7.
11.5 Conclusion 371
Our conclusion from the reduction examples is that switching from 4-byte to 2-byte variables
gives a factor of 2 speed-up as we would expect for this memory access dominated problem.
This reinforces a key message of this book that memory access is often the limiting performance
factor on both GPU- and CPU-based computations and using float rather than double where
possible is really important. We can now add the recommendation use half rather than float
when possible – but obviously the limited precision makes this impossible in many cases. From
the timings above it is not clear if using tensor cores to perform addition is faster than simply
using thread-based additions. This is because the GPU is good at hiding computational cost in
memory bound calculations. We have, however, shown the method works and is not slow.
Another interesting feature added to the third generation tensor cores of Ampere (CC 8)
GPUs is a sparsity feature involving the automatic dropping of small values from the
A fragments to gain another factor of two in performance. Specifically, this feature is
implemented in hardware by dropping the smallest two elements out of each set of four
elements along the rows of the a matrix. This feature is intended for training neural networks
where the a matrix holds weights applied to the node inputs. In large problems there are
initially many such weights, most of which become unimportant during training. If the
sparsity feature is used appropriately during training, the process will be faster with no loss
of correctness. I have to say this is interestingly close to natural neural evolution where
synaptic connections get stronger with use or atrophy when not used. In the spirit of
Example 11.3, we speculate that this sparsity feature might also be used for sorting or at
least finding the maximum values in large datasets.
11.5 Conclusion
This is the end of the last main chapter in our book and we seem to have ended where we
began – with yet another variation of the reduction problem. This is not because we are
obsessed with adding up lots of numbers really fast, but because reduction is an exemplar for
any memory bound GPU problem. Such memory bound problems are often hard to port
effectively to GPUs. In other chapters we have explored a variety of interesting real-world
problems and presented fully working GPU code which often gives impressive speed-ups for
what I consider to be relatively little effort.
Importantly, all our code is available online for readers to use as starting points for their
own problems. In developing these examples, we learnt a lot and had some surprises. We
were surprised by the erratic performance of built-in Cþþ random number generators using
both Visual Studio and gþþ. We were astonished by how badly denormalised floating point
numbers slow down Intel CPUs. On the other hand, in Chapter 8 we were gratified by the
unexpected performance gains achieved in our PET reconstruction code by using 50,000
256 threads instead of the more normal 256 256 or 1024 256.
We plan to extend some of the examples presented here and to update the code repository
from time to time and to add material for new versions of CUDA – so please check for this.
Finally, this book is not really finished yet; there is more interesting material in
the appendices.
372 Tensor Cores
Endnotes Chapter 11
1 Remember that floating-point fraction starts with an implicit 1 so the 10-bit fraction can represent
11 bit values.
2 On the host, these arithmetic operations are not supported by the hardware and have to be emulated by
software, they are hence quite slow. On the GPU they are supported by hardware. In both cases a large
set of intrinsic functions are available and should be used whenever possible.
3 Buffering the b tiles, though apparently unnecessary, does give further small speed-up. However, more
importantly, we found that loading a tiles from shared memory but b tiles directly from GPU main
memory did not work on our platform (Windows 10 home edition, CUDA SDK 11.0 and RTX 2070
GPU). The program ran but gave incorrect results which varied from run-to-run indicative of a race
condition which we were unable to resolve.
sanet.st
Appendix A
CUDA was launched by NVIDIA in 2007 as a high-level programming tool for enabling
scientific computation on their GPUs developed for gaming on PCs. Prior to this there had
been at least a decade of efforts by individuals to use the then-available and rather primitive
GPUs for this purpose. The acronym GPGPU (general-purpose computation on graphics
processing units) was coined to describe these efforts, and an influential website (www
.gpgpu.org) was set up to document and share ideas.1 Although there were some successes,
progress was limited by lack of floating-point support and small graphics memories. The
actual programming of these early GPUs was also difficult; the general idea was to trick the
hardware shader and texture units into calculating scientifically useful results. The resulting
code tended to be specific to one GPU and hence difficult to port between different GPUs;
this was frustrating, as types of GPUs were evolving very rapidly in that era.
NVIDIA CUDA actually replaced the earlier NVIDIA Cg computing language, which,
while not general purpose, was being used for GPGPU applications and offered some degree
of portability. Cg had a C-like syntax but was not actually C and was really intended as a tool
for game developers. Cg continued to be supported by NVIDIA until 2012, and at present
(2022) it is still available as a deprecated legacy product on their website.2
CUDA changed everything; it enabled GPGPU programming in C by adding a few (and,
in my opinion, very elegant) extensions. Moreover, it was a strong statement by NVIDIA
that they wanted to support GPGPU applications on their GPUs. Since that launch, GPU
computing has changed from a niche to a mainstream activity, and NVIDIA has become a
dominant force in high-end supercomputing. This is the reason why we have chosen to write
a GPU programming book about CUDA rather than possible alternatives such as OpenCL or
OpenMP. We think CUDA code is simpler and more elegant and thus facilitates the creation
and maintenance of better code. Also, because CUDA is vendor-specific, it gives better
access to NVIDIA hardware features such as texture lookup with interpolation and recent
innovations such as the tensor cores.
373
374 A Brief History of CUDA
higher values. The recent Turing generation of GPUs has CC 7.5; CC is backwards
compatible, so, for example, a Maxwell GPU with CC 3.5 has all the features of GPUs up
to that level but not some of the features of later GPUs with higher values of CC. The details
of which features go with which CC levels are specified in an appendix of the “NVIDIA
CUDA C Programming Guide”, which is included in the documentation set for each release
of CUDA.3
The software supporting CUDA is provided by the NVIDIA SDK, and this also evolves in
time and has a similar but different numbering scheme. Thus, while the initial SDK release
was CUDA 1.0, the numbering scheme then evolved differently to the GPU CC level. For
example, as of December 2020, the most recent Ampere GPUs have CC level 8.6 but the
most recent CUDA SDK is version 11.5 (or CUDA 11.5 for short). The associated hardware
drivers also need to be updated from time to time to match changes in CUDA hardware
and software.
Although the arrival of CUDA was a huge step for GPGPU programming, some restric-
tions in the early hardware and software and their associated workarounds have lingered on
and even now are overemphasised in many tutorials. Firstly, accessing early GPU memory
was slow, and strict coalesced memory accesses (i.e. neighbouring threads access neigh-
bouring 4-byte words) were essential for decent performance. This meant that early CUDA
examples, including those supplied in the initial CUDA SDK, emphasised the use of fully
coalesced memory access or use of shared memory when this was not possible. Additionally,
the use of the dedicated texture and constant memory spaces was recommended to reduce
latency where appropriate. The end result produced some really complicated examples,
sanet.st
which juggled these various tricks to maximise performance. While this was important in
the early days, it has become increasingly less important as successive generations of GPUs
brought in better and bigger caching of GPU memory. Unfortunately, much of the current
learning materials, including both CUDA SDK examples and some textbooks, have not kept
pace with these changes and can make CUDA development appear to be more complicated
than necessary. In developing the examples for this book, we have found that in many cases
straightforward code that does not use shared memory or other complicated tricks performs
just as well as, or better than, tricky code based on early SDK examples.4
The second legacy problem is that the SDK examples in the first CUDA release were
written in essentially ANSI C. Hence the code was littered with bare pointers managed with
explicit malloc and free statements, and was quite verbose. Sadly, although the NVCC
compiler now supports Cþþ up to Cþþ17, many examples have not caught up.5 In this
book we have used modern Cþþ to simplify our code while keeping the style straightfor-
ward and readable and avoiding excessive use of abstraction.
So far NVIDIA have released eight generations of GPUs named after famous physicists or
mathematicians. Within each generation different cards exist for gaming, HPC and worksta-
tion applications. With each generation, the hardware and software capabilities of the GPUs
have increased. Tables A.1 and A.2 show a summary of naming details and the evolution of
hardware performance. Table A.2 is a nice illustration of the dramatic consequences coming
from a decade of Moore’s law evolution – all the capabilities of the GPUs have increased.
The total number of cores on a GPU has increased steadily by a factor of 20 from 512 on the
Fermi GTX580 to 10,496 on the Ampere RTX3090 – the individual cores have also become
more powerful. Interestingly the number of cores per streaming multi-processor (SM), while
A.1 Evolution of NVIDIA GPUs 375
remaining a multiple of the warp size 32, has had a different trajectory. The number of warps
per SM peaked at six (192 cores) in the early Kepler generation and then fell back to two (64
cores) by the Pascal generation. This is because many kernels have their performance limited
by the number of SMs rather than the number of cores. For this reason, Kepler often failed to
deliver a significant real-world performance enhancement over Fermi in spite of the greatly
increased number of cores.6
The fall in power consumption per TFlop over 10 years is a dramatic and welcome
reduction. This is especially true for high-end supercomputers that use literally thousands of
GPUs. For example, the second ranked machine in the November 2001 TOP500 list is the
Oak Ridge Summit system. It has 27,648 V100 GPUs and delivers about 150,000 TFlops on
a high-performance Linpack benchmark.
Of course, not just the number of cores but also their computational capabilities have
improved over time. Much of this is due to detailed changes such as improved atomic
operations, faster memory and better caching strategies.
The Volta generation introduced a step change in design of the compute cores. As in
previous generations, each core can perform both FP32 and INT32 arithmetic operations, but
in Volta a core can perform both types of operation simultaneously, whereas in previous
generations a core could only perform one of these operations in a cycle. Another important
change is to move away from the strict SIMT principle where all 32 threads in a warp
execute the same instruction in lockstep. Starting from Volta, individual threads now have
individual PCs instead of having a single program counter (PC) used by all threads in the
warp.7 This breaks older kernel code, which assumes there is an implicit synchronisation
between the 32 threads of a warp and hence omits the calls to __syncthreads() that
would otherwise be logically necessary. From Volta onwards an explicit synchronisation is
necessary, as now the threads may not be executing in lockstep. However, the lightweight
https://doi.org/10.1017/9781108855273.013 Published online by Cambridge University Press
Generation Kepler Maxwell Pascal Volta Ampere Ampere Ampere Turing Turing
Chip GK110B GM200 GP100 GV100 GA100 GA102 GA102 TU102 RTX TU106
GPU K40 M40 P100 V100 A100 A40 RTX3090 2080Ti RTX 2070
Compute capability 3.5 5.2 6.0 7.0 8.0 8.6 8.6 7.5 7.5
Date 10/2013 11/2015 06/2016 06/2017 05/2020 10/2020 09/2020 09/2018 10/2018
Total cores 2880 3072 3584 5120 6912 8704 10496 4352 2304
SMs 15 24 56 80 108 84 82 68 36
Cores per SM 192 128 64 64 64 128 64 64 64
Floating point FP16 – – 21.2 31.4 78 37.4 35.6 28.5 15.8
peak FP32 5.2 6.1 10.6 15.7 19.5 37.4 35.6 14.2 7.9
376
performance FP64 2.6 3.1 5.3 7.8 9.7 0.584 0.508 0.420 0.247
(TFlop) TC 16/32 – – – 125 156 149.7 71 113.8 31.5
Maximum memory size (GB) 12 12 16 32 40 48 24 11 8
Memory bandwidth (GB/s) 288 288 703 877 1215 696 936 616 448
Registers per SM (K) 64 64 64 64 64 256 256 64 256
L1+Shared per SM (KB) 64 24+96 24+64 128 164 128 128 96 96
L2 cache (KB) 1536 3072 4096 6114 40,000 6144 6144 5632 4096
Fabrication (nm) 28 28 16 12 7 8 8 12 8
Transistors ( 109) 7.1 8 15.3 21 52.2 28.3 28.3 18.6 10.8
Max power (Watts) 235 250 300 300 400 300 350 260 185
Watts per FP32 TFlop 45 41 29 20 20.5 8.0 9.8 18.3 12
Relative performance 0.4 0.5 2.1 3.9 6.7 7.4 9.4 2.5 1.0
A.2 The CUDA Toolkit 377
__syncwarp() function, which is local to each warp, can be used for this purpose instead
of the expensive __syncthreads() function, which has thread block wide effect. More
details about this are in Chapter 3 on cooperative groups.
From Table A.1 we see that the more recent GPUs have higher compute capabilities. In
most generations the same generic architecture is available as a low-end GeForce gaming
card, a Tesla HPC card without graphics capability but much enhanced double-precision
performance or a Quadro workstation card with both graphics and good double-precision
performance. The named cards are examples of high-end cards in their class; other versions
with fewer or sometimes more cores are available. For example, the RTX 2080 has 2944
cores, whereas the RTX 2070 has 2304 cores and the RTX 2080 Ti has 4352 cores. The
codenames in the last column refer to the underlying chipset architecture. Compute capabil-
ities are backwards compatible, but recent drivers no longer support CC levels 1 or 2.
The GPU used for performance measurements in this book is the RTX 2070, which is a
relatively inexpensive gaming card. The last row of Table A.2 shows the relative perform-
ance of other GPUs using the product of FP32 GFlops/sec and memory bandwidth in GB/sec
as a metric. The RTX 3090, launched in January 2021 at an initial price of about £1400,
performs a factor of 10 better on this metric.
An interesting recent advance, starting with Pascal, is the introduction of hardware
support for FP16 calculations at twice the speed of FP32. Although a single FP16 variable
can be used in code, the speed advantage is only gained by using intrinsic functions
operating on pairs of FP16 variables stored in 32-bit words. The reason for this is that the
hardware implements FP16 operations using the existing FP32 registers modified to operate
on such pairs in parallel. This is actually an example of thread-wise SIMD operations
supported by CUDA. The Volta and later architectures massively extend the usefulness of
FP16 calculations by introducing tensor cores (TCs) to support mixed FP32 and FP16
arithmetic for 44 matrix multiplication, which can deliver a peak of over 100 TFlops of
computation, as shown in the TC 16/32 line of Table A.2. Although the introduction of TCs
and FP16 support is intended for machine learning applications, it is likely that other
application areas will be found.
The important area of inter-GPU communication has also seen recent advances. It is now
common for a single workstation to have more than one GPU installed, and in the past they
would have been managed by the host passing data back and forth to the GPUs across the
PCIe bus. Recent Tesla and Quadro cards have NVLINK, a much faster direct GPU-to-
GPU interconnect.
CC
Version Date support Added features
11.0 May 2020 3.0 – 8.0 Ampere support
10.0 Sept 2018 3.0 – 7.5 Turing support. 10.1 (Feb 2019) and 10.2 (Nov 2019)
9.2 May 2018 3.0 – 7.0 Maintenance release.
9.1 Dec 2017 3.0 – 7.0 Passing __restrict__ references to __global__
functions supported
9.0 Sept 2017 3.0 – 7.0 Volta support. Tensor cores, simultaneous FP32 & INT32 per
core. Warp shuffle supports 8-byte fields. Cooperative groups
introduced. __syncwarp() or __syncthreads() now
mandatory for warp level programming. New synchronising
versions of warp vote functions introduced, __activemask()
added. Ended support for Fermi and below.
8.0b Feb 2017 2.0 – 6.2 Pascal support. AtomicAdd for FP64 in global and shared
memory.
7.5 Sept 2015 2.0 – 5.3 C++11 support. 8.0a (Sep 2016). Maintenance release
7.0 Mar 2015 2.0 – 5.2 Ended support for Tesla (CC < 2.0). CUSOLVER library
introduced.
6.5 Aug 2014 1.1 – 5.0 Maintenance release.
6.0 Apr 2014 1.0 – 5.0 Maxwell support.
5.5 Jul 2013 1.0 – 3.5 Maintenance release.
5.0 Oct 2012 1.0 – 3.5 Dynamic Parallelism. FP16 operations on device. Funnel shift.
4.2 Apr 2012 1.0 – 3.0 Kepler support. Unified memory programming. Warp shuffle
functions __shfl() etc. introduced for CC ≥ 3.0.
sanet.st
4.1 Jan 2012 1.0 – 2.1 CUBLAS library introduced.
4.0 May 2011 1.0 – 2.1 cuRAND, cuFFT, cuSPARSE, NPP and THRUST libraries
introduced.
3.2 Nov 2010 1.0 – 2.1 Maintenance release.
3.1 Jun 2010 1.0 – 2.0 Maintenance release.
3.0 Mar 2010 1.0 – 2.0 Fermi support. Atomic functions for FP32 in global and shared
memory and INT64 in shared memory. Limited FP16 support
in textures. 3D thread block grids. Surfaces introduced.
2.3 Jun 2009 1.0 – 1.3 Some C++ support, including function templates and operator
overloading.
2.2 May 2009 1.0 – 1.3 Pinned Memory support.
2.1 Jan 2009 1.0 – 1.3 cudaMalloc3D() and cudaMalloc3DArray() added.
2.0 Aug 2008 1.0 – 1.3 Atomic functions for INT64 in global memory and for INT32 in
shared memory. Support for FP64 in device code. Warp vote
functions __all(), __any() and __ballot()
introduced.
1.1 Dec 2007 1.0 – 1.1 Atomic functions for INT32 in global memory.
1.0 June 2007 1.0 Tesla support. Initial release.
GPU CC level supported. Support for the early Tesla and Fermi generations ended with
Toolkit versions 7.0 (2015) and 9.0 (2017), respectively.
Of particular note is Toolkit 9.0, which introduced support for Volta (CC 7.0) requiring
significant changes in the management of warp-level programming that may break
older codes. Toolkit 7.5 is also interesting in that it introduced good support for Cþþ11
A.2 The CUDA Toolkit 379
features – sadly, many CUDA tutorial examples in NVIDIA’s own example set and else-
where do not yet take advantage of Cþþ11 to simplify code. One feature of our examples is
the use of such features where they simplify our code.
On a Windows machine, the Toolkit is typically installed under C:\Program Files\.
On my system, this is C:\Program Files\NVIDIA GPU Computing Toolkit
\CUDA\v10.1, as indicated in Figure A.1.
The top-level install directory contains release notes and licence information in .txt files
and directories for all the essential components; some of these directories will be added to
the system search path at installation time. The more important directories are:
• bin: Executable files, including the CUDA NVCC compiler nvcc.exe and profiler nvprof.
exe. A large number of .dll files are also held in this directory.
• doc: Comprehensive documentation in both HTML and PDF formats. The “CUDA Cþþ
Programming Guide” is a good place to start, but there are detailed guides for those who
want to go deeper. There are also guides for the various libraries, such as cuRAND and
THRUST. The “CUDA_Samples” guide tells you all about the samples included in
the SDK.
• include: A large number of .h files are necessary to compile CUDA code. Most user code
in .cu file only needs to explicitly include cuda_runtime.h. This file will load other .h
files as necessary. If your code mixes .cu and .cpp files, then the .cpp files (which are not
compiled by NVCC) may need to include vector_types.h to access the definitions of
CUDA types such as float3.
• lib: A large number of .lib files support various CUDA options. For example, you will
need curand.lib if you make use of cuRAND in your code.
• src: A small number of files for Fortran support.
• tools: A single file, “CUDA_Occupancy_Calculator.xls”, which, as the name
implies, can be used for occupancy calculations.
380 A Brief History of CUDA
On Windows, the toolkit installation process will add a plugin to Visual Studio Cþþ,
enabling you to create CUDA projects directly without needing to use the process described
in the “CUDA_Samples” guide. For this to work, Visual Studio must be installed prior to the
installation of the CUDA SDK.
Appendix A Endnotes
1
Sadly, this historic website has now gone but “gpgpu” is still a very useful keyword for web searches.
2 Somewhat confusingly for people with long memories, recent versions of CUDA contain a cg name-
space which is used for cooperative groups. The eagle eyed will notice “c” is not capitalised in this case.
3 Since the 2019 CUDA 10.1 release, the documentation covers compute capabilities between 3.0 and
7.5; support for CC levels 1 and 2 has now ended.
4 This is, of course, an oversimplification; there are still some cases where shared memory is very helpful.
However, constant memory is now automatically used by the compiler for kernel arguments which are
declared const, so explicit use of constant memory is less necessary.
5 For instance, the CUDA SDK addon for Visual Studio helpfully provides a code sample, kernel.cu,
when creating a new CUDA project. Unfortunately, even in 2022 this sample is written in essentially
early C and even uses the dreaded goto statement for error handling.
6 To be fair to NVIDIA, the extra cores on Kepler cards did function well for gaming purposes and at that
stage gaming was NVIDIA’s main focus. A decade later Turing cards are superb for both gaming and
HPC and have new hardware features for both, Tensor cores for AI and ray tracing (RT) units
for graphics.
7 The Volta and Turing hardware typically requires two register slots per thread for the thread’s PC. This
might impact code with demanding per-thread register requirements.
8 The web link is https://developer.nvidia.com/cuda-toolkit. Versions are available for Windows, and
Linux. Support for MacOs ended with version 10.2.
9 Explorer, (renamed File Explorer in Windows 10) is a venerable and much-loved tool replicated in
MacOS and Linux GUIs. However, Microsoft has some curious default choices, one of which is making
ProgramData a hidden directory. An even worse default is “hiding extensions for known file types”
which makes it much more likely that the naïve user will click on an evil .exe file planted by a virus. On
Windows 10 these options can and should be changed using “change folder and search options” under
File Explorer->view->options.
Appendix B
Atomic Operations
The essence of parallel programming is that many threads run simultaneously and coopera-
tively to solve a problem – so what happens if two or more threads try to write to the same
memory location simultaneously? This is a generic problem for any parallel system using
shared memory, that is, where multiple active threads can access the same block of memory
at the same time. Reading from shared memory alone is not a problem and indeed NVIDIA
GPUs feature a dedicated block of constant memory for just this purpose. But writing to
shared memory raises two difficult issues:
1. What happens if two or more threads write to the same memory location simultaneously?
On NVIDIA hardware one thread will succeed and the others will silently fail. This
produces incorrect results in most cases.
2. The so-called read-after-write problem. If some threads are writing to memory while
others are reading from it, how cansanet.st
we be sure the necessary write operations have
completed before other threads perform reads. This problem is greatly complicated by the
hierarchy of caching levels used in modern computing systems – how can we ensure main
memory and all caches are kept in synchronisation? This is the cache–coherency
problem.
Problem 1 is solved by implementing atomic operations in hardware or software. If two or
more threads use an atomic function to write to the same memory location simultaneously,
the requests are queued in hardware and executed one at a time. They all succeed but there is
no guarantee about the order in which they are executed. Such queued operations are known
as atomic operations and CUDA provides a range of atomic functions that run on the GPU.
This appendix describes these functions.
We should point out that problem 2 is much harder to solve and basically it is up to the
programmer to synchronise threads so as to avoid read-after-write errors – the various
CUDA sync functions are the key to doing this.
B.1 AtomicAdd
The CUDA atomicAdd function is typical and the version for 32-bit integers is shown in
the box.
382
B.1 AtomicAdd 383
The atomic functions in this table are atomic for all threads on a single GPU in the kernel launch. For
recent GPUs with CC ≥ 6.0, the following additional forms are valid:
atomicFun_block: The operation is atomic for all threads in one thread block but not between
thread blocks.
atomicFun_system: The operation is atomic across multiple GPUs and also host memory in
managed memory scenarios.
This function has been available to all GPUs with CC ≥ 1.1 but on the early GPUs
it required the variable being updated is in device global memory. On all modern GPUs
(CC ≥ 3.5) shared memory can also be used. Over GPU generations additional variable types
have been added to those supported by the atomic functions. AtomicAdd now supports more
384 Atomic Operations
types than any of the other atomic functions and the full set (up to CC 11.2) is shown in
Table B.1. Note that atomic functions are not template functions; they use standard C/Cþþ
compiler function selection based on argument types. Some types require more recent GPUs
for support; 64-bit integers require CC ≥ 3.5, double, half2 and nv_bfloat162 require
CC ≥ 6.0, half requires CC ≥ 7.0 and nv_bfloat16 requires CC ≥ 8.0.
On early GPUs, atomic functions were slow and were avoided by programmers wherever
possible. On more recent GPUs, they are significantly faster especially when only a few threads
actually compete to perform the operation. A good example is our fast reduce kernel where only
one thread in a warp competes to perform an addition to an element of the output array.
B.2 AtomicCAS
The atomicCAS function is interesting because it can be used to make new atomic
functions. Example B.1 shows an implementation of integer atomicAdd using
atomicCAS:
simultaneously only at most one thread will be allowed to succeed by the atomicCAS function.
Importantly, the value returned by atomicCAS is used to update acc_now and this is the value
that was used by atomicCAS in the comparison with acc_test for this thread.
◦ Line 16: If after calling atomicCAS the updated value of acc_now is equal to acc_test the
CAS test must have succeeded, so this thread now exits the while loop.
• Line 18: On exiting the function we return the value in acc[0] used at the start of this thread’s
successful call to atomicCAS.
I would rate this function as being quite tricky code; we need to know that atomicCAS
and all the other built-in atomic functions only ever let one thread succeed at any one time.
We have to be very aware that this code is being executed in parallel. The code is also very
inefficient if called simultaneously by many threads as their accesses will be effectively
serialised. In our fast reduce examples we used warp level reduction and then only one
thread from each warp called atomicAdd. This turns out to be as good as, if not better than,
any other approach. Obviously, you should avoid letting all 32 threads in a warp call atomic
functions simultaneously. The warp shuffle functions are helpful here.
Another point to make is that the above version of atomicAdd is likely to always be less
efficient than the standard version because modern GPU hardware directly supports common
atomic operations.
Interestingly, atomicCAS only takes integer type for its arguments – so how can we
implement something like atomicAdd for floats or doubles? The answer is to notice that
atomicCAS either does nothing if the CAS test fails, or if it succeeds the third argument is
simply the address pointed to by the first argument. There is no arithmetic performed so we
can make atomicCAS work for floats if we do some gymnastics with type casting. The
result is shown in Example B.2.
26 if(__float_as_uint(acc_test) ==
__float_as_uint(acc_now)) break; // OK?
27 }
28 return acc_now;
29 }
386 Atomic Operations
This version of atomicAdd is likely to be less efficient than the standard one. Note the
use of the CUDA built-in cast __float_as_uint().This is an instruction to the NVCC
compiler to treat the bit pattern as a uint; the bit pattern is not changed. Since we only
copy or compare two floats, both of which use this cast, correct results are obtained. It is,
however, necessary that uint and float both have the same size (4 bytes).
The atomicCAS function can also be used to provide a MUTEX (mutual exclusive
access) flag that a group of threads can use to serially execute any piece of code but care is
required to avoid deadlock between divergent threads. Serialising single thread execution in
CUDA is very slow and should obviously be avoided.
Appendix C
CUDA programs are compiled using the NVIDIA NVCC compiler; this is either done
implicitly in Visual Studio using the NVIDIA supplied build tools or explicitly on the
command line in Windows or Linux using the NVCC command.
10 "%CUDA_PATH%\bin\nvcc.exe"
11 -gencode=arch=compute_75,code=\"sm_75,compute_75\"
12 --use-local-env
13 -ccbin "%VS2017INSTALLDIR%\VC\Tools\MSVC\14.16.27023\bin
\HostX86\x64"
14 -x cu
15 -I"D:\Users\rea1\OneDrive\Code\inc"
16 -I"%NVCUDASAMPLES_ROOT%\common\inc"
17 -I"%CUDA_PATH%\include"
18 --keep-dir x64\Release
19 -maxrregcount=0
20 --machine 64
21 --compile
22 -cudart static
23 --use_fast_math
387
388 The NVCC Compiler
• Line 10: Runs nvcc using an explicit path to the exe file.
• Line 11: GPU code for a CC=7.5 (Turing) device is to be generated (this choice is not a default).
• Line 12: Tells NVCC that the local environment variables have been set to run the local Cþþ
compiler (VC in this case).
• Line 13: Path to local Cþþ compiler.
• Line 14: The input is a .cu file.
• Lines 15–17: These are extra directories to search for include files; they are specified by prefixing -I.
• Line 18: Relative path to output directory for .exe file.
• Line 19: Maximum number of registers per thread. The value zero is a default leaving the compiler
free to choose. Specifying a value greater sanet.st
than 32 might reduce the maximum occupancy of the
kernel. 32 is usually the best choice but this is one flag you may want to tune for optimal
performance. For more information see the NVCC reference manual.
• Line 20: Compile for 64-bit OS.
• Line 21: Compile all files in argument list after options (this is a default).
• Line 22: Specify CUDA run-time library; possible choices are none, shared and static. The
default is static.
• Line 23: This is our favourite option – it makes your kernel code faster but it is not a default.
The --use_fast_math flag also implies the additional options --ftz=true, --prec-
div=false, --prec-sqrt=false and --fmad=true.
• Line 24: This is a standard list of definitions passed to the VC compiler.
• Line 25: The keyword Xcompiler introduces a list of options for the host Cþþ compiler; in this
case VC.
• Line 26: The option -o names the output files. If omitted the files a.obj and a.exe are created.
This flag should always be used.
• Line 27: This is the file to be compiled which come after the options. It is not prefixed with -- or -
and a list of several files can go here.
Note that in simple cases the complexity of Example C.1 is not needed; on my
Windows10 system I can perform the same compilation with the command shown in
the box:
C.2 NVCC Options 389
where the “^” indicates that a continuation line follows. The include from line 17 is
a default. On Linux the environment variables NVCC_PREPEND_FLAGS and
NVCC_APPEND_FLAGS can be defined to hold commonly used strings. Note that Visual
Studio generates a separate -I for each path but a single “I” followed by a comma separated
list of paths is also valid. A space between the “I” and filename is also allowed:
A commonly used set of options can be read from a text file using -optf <file>:
Note that Windows parameters such as NVCUDASAMPLES_ROOT are not expanded when
read from text files and thus would have to be replaced by their actual values.
The process of building an executable program file from a .cu file is quite complex.
Device code is first compiled by NVCC producing an assembly language PTX for the GPU
code. The PTX file is then converted in machine code for the target device and merged with
code from the host compiler to finally produce an executable file. Interested readers can find
more detail in the NVIDIA NVCC Reference guide.
◦ --verbose or –v: List the compilation commands generated by this compiler driver, but do
not suppress their execution.
◦ --keep or –keep: Keep all intermediate files that are generated during internal
compilation steps.
◦ --keep-dir <dir> or -keep-dir <dir> Keep all intermediate files that are generated
during internal compilation steps in this directory.
◦ --save-temps or -save-temps: This option is an alias of --keep.
◦ --clean-targets or –clean: This option reverses the behaviour of NVCC. When
specified, none of the compilation phases will be executed. Instead, all of the non-temporary
files that NVCC would otherwise create will be deleted.
◦ --run-args <arg>,. . . or -run-args <arg>,. . .: Used in combination with option
--run to specify command line arguments for the executable.
• Options for steering GPU code generation.
◦ --gpu-architecture <arch> or -arch <arch>: Specify the CC level of the target
GPU, for example, -arch=sm_75 for Turing CC=7.5 devices. The resulting exe file will run
on all devices of the specified CC level and higher. The current (December 2021) minimum
supported CC level is CC=3.0 but levels below 5.2 are deprecated. This is an important
parameter to get right; we recommend always explicitly using the CC level of your
target device.
◦ --gpu-code <code> or -code <code>: This controls the CC level of the actual GPU code
generated. If omitted the value set by -arch is used which is usually what you need. See the
NVIDIA documentation for more information.
◦ --maxrregcount <value> or -maxrregcount <value>: This is the maximum
number of registers that GPU functions can use. This parameter is actually obsolete and should
not be used. The compiler and run time system will normally optimise this for you.
Experienced users can also use the Launch Bounds feature, described in Appendix B of the
CUDA Programming Guide, in kernel code to give hints to the compiler.
◦ --use_fast_math or -use_fast_math: This is the one option we use nearly all the
time. If in doubt about accuracy, try with and without using this switch and compare numerical
results. In some case the performance effect is dramatic; in others it makes little difference.
Experienced users can omit this switch and explicitly mix fast intrinsic functions with slower
standard versions in kernel code to optimise both speed and accuracy. This option also implies
--ftz=true, --prec-div=false, --prec-sqrt=false and --fmad=true.
◦ --extra-device-vectorization or -extra-device-vectorization: This
option enables more aggressive device code vectorisation in the NVVM IR1 optimiser. We
have not experimented with this option.
• Options for steering cuda compilation.
◦ --default-stream <value> or -default-stream <value> one of legacy or
null or per-thread: Specifies the default stream on which GPU work will be queued. We
recommend the default per-thread which refers to CPU threads not GPU threads. This is
the only flag in this section.
• Generic tool options.
◦ --disable-warnings or -w: Suppresses all warning messages.
◦ --source-in-ptx or -src-in-ptx Interleaves source in PTX code. May only be used
in conjunction with --device-debug or --generate-line-info.
◦ --restrict –restrict: Treat all kernel pointer arguments as restrict pointers. We do
this explicitly in all our examples.
392 The NVCC Compiler
Appendix C Endnotes
1 The NVIDIA NVVM IR compiler is a version of the LLVM IR project compiler. More information
can be found in the NVIDIA NVVM IR Specification Reference Manual which is part of the
sanet.st
SDK documentation set. More details of the LLVM compiler infrastructure project can be found at
https://llvm.org/.
Appendix D
In this section we discuss how best to access the SIMD capabilities of an Intel CPU. As
mentioned in Chapter 1 Intel added hardware to perform SIMD operations on vectors of
floats and integers beginning with the MMX instruction on the Pentium II in 1977.1
On a computer an operation such as x = y * z is performed by the hardware, first
loading y and z into hardware registers, then performing multiplication using the ALU with
the result appearing in another register and finally storing the result in x. The registers used
typically only have enough bits to hold a single variable, for example 32-bits for a 32-bit
variable. The idea of SIMD hardware is to support vector operations such as X = Y * Z
where multiplication is performed on corresponding components of the vectors simultan-
eously. For this to work with vectors of say, four 32-bit floats the registers need to be 128 bits
wide to hold all the components of the vectors and the ALU needs to be upgraded to perform
four 32-bit multiplications simultaneously. For cost reasons the SIMD-ALU needs to only
support the most important operations. The original MMX hardware used registers that were
64 bits wide and only supported integer arithmetic. Thus, the supported vector types were
two component 32-bit integers (I32) or four component 2-byte integers (I16) or eight
component 1-byte integers (I8). Later hardware versions use wider registers which support
more components. The most recent architecture AVX-512 uses 512-bit wide registers and
supports simultaneous operations on 16 component 4-byte vectors. Table D.1 gives a brief
overview of the history and more details can be found online, https://en.wikipedia.org/wiki/
Advanced_Vector_Extensions for example.
In the table most generations are backwards compatible with earlier versions. Successive
generations include numerous other improvements such as wider instruction sets that are not
detailed here.
D.1 MMX
The first MMX (standing for MultiMedia Extensions) instructions were added to the Pentium
II in 1997 to support graphics and audio processing and only worked with integers. It is
interesting to note that like GPUs the original intent was to support entertainment on PCs not
“serious” computing. The new MMX registers were 64 bits and thus supported simultaneous
operations on two 4-byte integers or four 2-byte integers or eight 1-byte integers. Floating
point capability was added soon after with the upgrading to SSE (streaming SIMD extensions)
on the Pentium III in 1999. For SSE the register width was increased to 128 bits but the only
supported operations were for 32-bit floats. However, MMX could still be used for integers.
393
394 AVX and the Intel Compiler
sanet.st
Figure D.1 Normal scalar and AVX2 eight-component vector multiplication
Successive generations not only increased the width of the operation registers but also added to
the available instruction set. The more recent instruction sets are very rich; the details are not
discussed here, but, if you are interested, more information can be found on Intel’s website:
https://software.intel.com/content/ www/us/en/develop/home.html ; try searching for avx2.
Figure D.1 illustrates multiplication using the 256-bit wide registers of AVX2 which allow
eight simultaneous operations on 4-byte variables.
Initially, the new hardware was difficult for programmers to use directly; essentially bits of
hand-crafted assembly code had to be inserted into programs. Gradually libraries for linear
algebra and the like began to appear which gave programmers an easier route to exploiting
SIMD. Library developers, however, struggled to keep up with the frequent updates to
hardware while retaining support for older versions so even this route was somewhat limited.
More recently compilers have started to automatically vectorise code during compilation, but
this only works for pieces of code where it is easy for the compiler to figure out that
vectorisation is possible. Nevertheless, it is likely that much of the software running on your
current PC is exploiting SIMD to some extent.
The results show a performance of about 3.5 GFlops/sec using the Intel ICC compiler and
1.6 GFlops/sec using Visual Studio. The ICC result is somewhat dependent to the value of
size; smaller values of size giving a better performance, presumably due to improved
memory caching. The VS compiled code is much less sensitive to the value of size.
// 1000th root of 10
10 __m256 ma = _mm256_set1_ps(1.002305238f);
15 cx::timer tim;
16 for(int k=0;k<size/8;k++){
17 for(int i=0;i<reps;i++){ // x = a*x+y
18 mx[k] = _mm256_fmadd_ps(ma,mx[k],my[k]);
19 }
20 }
21 double t1 = tim.lap_ms();
398 AVX and the Intel Compiler
// get 8 elements
22 float check[8]; _mm256_storeu_ps(check,mx[7]);
23 double gflops = 2.0*(double)(size)*(double)reps/
(t1*1000000);
24 printf("avxsaxpy: size %d, time %.6f ms check %10.5f
GFlops %.3f\n",size,t1,check[7],gflops);
25 free(mx); free(my); //tidy up
26 return ;
27 }
• Line 4: This added include of immintrin.h is necessary to use the Intel intrinsic functions.
• Line 10: This creates the __m256 object ma containing eight copies of the scaling constant a. This
is equivalent to line 9 of (a). The function _mm256_fmadd_ps returns an __m256 object with
eight copies of its input argument. If we want to fill a vector with different numbers then the function
_mm256_set_ps can be used; this function takes eight separate arguments specifying the
required values. sanet.st
• Lines 11–12: These are the equivalent of lines 10–11 of (a) and create arrays mx and my of __m256
objects of dimension size/8 which is enough to hold size 4-byte floats. Here again we use plain
malloc rather than a container class. We found using std::vector here leads to a drop in
performance, possibly due to memory alignment issues.
• Lines 13–14: Here we fill the mx and my vectors with values 1 and 0 respectively again using the
_mm256_fmadd_ps function.
• Lines 15–21: This is the timed saxpy loop equivalent to lines 13–15 of Example D.1. Note the outer
loop of the array index k only needs size/8 iterations as we process eight floats per step. The
actual saxpy calculation is performed in line 18 using the _mm256_fmadd_ps function which
returns product of its first 2 arguments plus the third. A single 8-fold fused multiply and add
instruction (FMA) will be used if supported by the hardware.
• Line 22: Here we extract a set of eight result values from the eighth element of the mx vector, mx[7],
using the function _mm256_storeu_ps. The values are stored in the float array check. The “u”
in this function name stands for unaligned, which means the function is designed to work even if the
array check is not aligned on a 256-byte memory boundary. If we were sure that the first element of
check was properly aligned, then the “u” in the function name can be omitted. The intrinsics
library contains many functions with and without “u” functionality.
A “feature” of the library is that when numbers are stored to __m256 objects their order is
reversed in memory. But when the numbers are subsequently extracted an order reversing operation
is not done. Thus check[7] from mx[7] actually corresponds the 129th element of the vector x
used in Example D.1. In this particular example all the values in x and y are the same. However, in
real code where they would be different this feature is a rich breeding ground for bugs.
• Lines 23–27: These lines are the same as 16–20 in Example D.1.
D.3 Intel Intrinsics Library 399
There are many more functions in the Intel intrinsics library which we have not described
here. A good place to look is https://software.intel.com/sites/landingpage/IntrinsicsGuide/,
which lists all the functions and provides numerous check boxes to select the ones of interest.
Individual function names can then be clicked to get details of their arguments.
We can get more CPU performance by parallelising the loop over size/8 in D.2 using
OpenMP. This is well supported by ICC and requires adding just one extra line as shown in
Example D.3. We also have to specify support for OpenMP in the project properties page
(C/C++ => Language [Intel C++] => openMP Support) .
. . .
15 cx::timer tim;
15.5 #pragma omp parallel for // OpenMP
16 for(int k=0;k<size/8;k++){
. . .
The CPU performance of D.3 at over 70 GFlops/sec is quite impressive for my 4-core
Haswell i7 4790 CPU.
As a final step we compare with a GPU implementation shown in Example D.4.
27 cx::timer tim;
28 gpusaxpy<<<blocks,threads>>>(dev_x.data().get(),
dev_y.data().get(), a, size, reps);
29 cx::ok( cudaDeviceSynchronize() );
30 double t1 = tim.lap_ms();
• Lines 10–17: These define the kernel function gpusaxpy which takes five input arguments, the
data vectors x and y, the scaling constant a and the iteration counts reps and size. One thread
performs the entire loop over reps and different threads process different elements of the vectors.
We use our usual thread-linear addressing to process the whole vector.
• Line 14: This is the for loop over reps which performs the saxpy calculation.
As an important detail we point out that our code appears to be using x[k] and y[k] as
variables in the loop even though they are references to GPU main memory and not local variables
stored in registers. In general, this must be avoided in CUDA kernels as it degrades performance.
However, in this case tests have shown that the compiler is smart enough to have done the work of
copying these array elements to local registers for us. Explicitly using temporary variables for the
loop turns out to make no difference in this case, but in other cases it might, so you need be alert to
this issue.
• Lines 18–39: This is the main program which is pretty standard and just prepares the data and
launches the kernel.
• Lines 20–23: The user supplied parameters as before but with threads and blocks added.
D.3 Intel Intrinsics Library 401
• Lines 25–26: These create and initialise device thrust vectors dev_x and dev_y for x and y. Note
that for this simple case where all the elements of a vector are equal we can initialise using the class
constructor. There is no need for matching host vectors in this case.
• Lines 27–30: This is the timed loop that runs the kernel.
• Lines 31–33: Print results.
The results show that the for size = 224 the GPU is delivering over 6.3 TFlops/sec which is about
85 times faster than the best CPU-only version. For the larger problem with size = 228 the GPU
delivers 8.2 TFlops/sec a speed-up of about 110.
If we compare the code in our host and GPU versions, to my eyes the CUDA version
actually seems cleaner than the AVX versions. But obviously where host code must be used,
Example D.3 is a big improvement on the single core version without AVX support. Note,
however, that the Intel intrinsics library is limited to linear algebra type problems whereas
CUDA kernels can be written to cover many additional problem types. It is the generality of
CUDA that accounts for its success. The potential for huge gains in performance is, of
course, another incentive.
One final lesson from this appendix is that it is well worth while trying the ICC compiler
on your host code, you may well get a performance gain simply by recompiling.
Appendix D Endnotes
1 Amiri, Hossein, and Asadollah Shahbahrami. “SIMD programming using Intel vector extensions.”
Journal of Parallel and Distributed Computing 135 (2020): 83–100.
2 Note if you are compiling just for one machine it makes sense to choose the highest option supported by
that machine. On the other hand, if you are compiling to distribute to a heterogeneous set of PCs it might
make more sense to turn this option off or choose the lowest option likely to be supported by all the
machines, maybe SSE2. See also our caveat about code making extensive use of the Intel intrinsic
SIMD functions where, counterintuitively, it might be best to turn this option off.
3 It turned out to be quite tricky to get the best result from either ICC or the VS compiler. For example, the
order of the two for loops in line 15 matters. For this problem, the VS compiler was consistently worse
than ICC.
Appendix E
Number Formats
This appendix explains how numbers (and other data) are actually held in computer storage.
In principle, you could write good code without knowing this. However we strongly
recommend that you look at this material because it will help you understand features such
as the differences between unsigned verses signed integer variables, floating point accuracy
and why boundary alignment matters.
All information in a computer is represented by the states of a set of switches. Each switch
can be either open or closed (but NOT in-between). The state of the switch is one bit of
information which represents the values 0 or 1. In practice, computer DRAM requires one
transistor1 and one capacitor to represent one bit of information. If the transistor is open
(conducting) or closed (non-conducting) then the voltage across the capacitor will be either
low or high, so for electrical reasons we might choose “open” to represent the 1 and “closed”
to represent 0.
sanet.st
A set of eight bits is called a byte and is the smallest addressable unit of computer
memory. Using the conventions of binary arithmetic, a single byte can represent integers in
the range 0–255 as illustrated in Figure E.1. Notice the bits within a byte are numbered from
0 to 7 from right to left, the most significant byte, worth 128 if set, is on the left.
402
E.2 Integer and Boolean Types 403
5 0000 0101
or
þ7 þ0000 0111
12 ¼ 0000 1100 ¼ 1210
where the subscript indicating that base 10 is being used.
Multiplication by 2 shifts the bit left by one position if the topmost bit was 1 it is lost – this is
an overflow error producing a mathematically incorrect result. Similarly, division by 2 shifts all
the bits one position to the right and the rightmost bit is lost. This is an underflow event and
means that integers are rounded down to the nearest integer on division. In general underflows
are less serious than overflows causing loss of precision rather than potentially catastrophic
errors. Overflow can also happen using addition; an interesting case, using one-byte of data, is
that 255þ1=0. What happens here is that all the bits are set in 255 so that adding 1 causes
arithmetic carry 1 to ripple all the way left clearing all the bits and then being lost as an
overflow. As is often the case in computing we turn the bug into a feature by noticing that since
in normal arithmetic -1þ1=0 we can decide 255 represents -1 not þ255 and indeed all values
> 127 (i.e. those with the leftmost bit set to 1) are negative numbers. This is twos complement
representation where the leftmost bit is a sign bit. You can go from any integer n to -n by
flipping the values of each bit and then adding one. In Cþþ we can declare integer variables
to be either signed or unsigned depending on if we want bigger positive integers or want to
allow positive and negative values in our variable. Note this is more than cosmetic; the
computer hardware will in fact execute different machine instructions for signed and unsigned
variables. One example is that the test ½0100 0000 > ½1000 0000 is true or false depending
on whether the variables concerned are char (true) or unsigned char (false).
Large integers and floating-point numbers require more than one byte to hold their values.
In practice, this means 2, 4 or 8 bytes are used (notice always powers of 2). The Cþþ values
are shown in Table E.1. The internal representation of integers is straightforward; the bytes
on the left representing higher powers of two than the bytes on the right. Figure E.1 shows a
2-byte example.
In the figures the bit pattern can be written as AC05 in hexadecimal and if interpreted as a
16-bit integer corresponds to the value of -4493710 if signed or 4378110 if unsigned. On
modern computers integers can be represented 1, 2, 4 or 8-bytes as shown in Table E.1. This
table also include details of the float and Boolean intrinsic types.
Alternate Can be
name Name Unsigned bits bytes Values
bool - 8 1 Either true or false
char - ✓ 8 1 Use for characters or short integers
short short int ✓ 16 2 integer [-215,+215] or [0,+216-1]
int - ✓ 32 4 integer [-231,+231] or [0,+232-1]
long long int ✓ 32 4/8 integer [-231,+231] or [0,+232–1]
long long long long int ✓ 64 8 integer [-263,+263] or [0,+264-1]
float - 32 4 float 3.4 1038 (about 7 sig figs)
double - 64 8 float 1.710308 (about 16 sig figs)
long double - 64 8 80 bits possible on Intel CPUs
than Boolean values are tested in if statements. I have some sympathy for this style; it has
always seemed wasteful to me for a function to return a Boolean error code rather than a
potentially more useful numerical error code. In CUDA and my own code many functions
return zero to indicate success and other values as error codes.
The long int type is something of a C legacy feature; the C/Cþþ standards do not
specify unique value for the length of this type. Rather they specify that the length (and
hence accuracy) of this type is at least the same at int. In early C implementations, int
variables had a length of 2 bytes, the same as modern short, and long int variables had
sanet.st int is now 4 bytes and the new type long
a length of 4 bytes. In modern implementations,
long is 8 bytes, thus long int is arguably redundant. Another problem with long int
is that it is likely to compile to 4-bytes in 32-bit code and 8-byte in 64-bit code. I think it is
best to make your intentions explicit and stick to int and long long.
denormalised (or unnormalised) floating point numbers. In this case, if some of their leading
fraction bits are also zero the number represented has a value less than 2-127 and is
represented less accurately. For example, if bits 13–22 of the fraction are also zero then
the value is the binary fraction represented by bits 0–12 multiplied by 2-137. As we have seen
previously, a denormalised number appearing in the calculations can have a catastrophic
impact on performance of host code and they should be automatically flushed to zero by the
hardware using either a compiler or run time switch. For GPU code flush to zero is enabled
as one of the –use_fast-math options.
Double precision 64-bit floats have asimilar
layout with 1 sign bit, 11 exponent bits and
52 fraction bits. Their accuracy is log 10 254 ¼ 16:256 significant figures. Integers up to 254
are held exactly in this format. The details are summarised in Table E.1. The use of double
variables in a CUDA program always requires care; the cheaper cards can perform far fewer
double precision operations per clock-cycle than float and for all cards double the memory
bandwidth is needed.
Recent releases of CUDA have introduced a 16-bit floating point type half and other
specialised variants for optimised tensor core calculations. These types are discussed in
Chapter 11.
Appendix E Endnotes
1 At the time of writing 1 GB of DRAM can be bought for £10.0, so you are getting ~10,000,000
transistors for one penny. These must be the cheapest objects ever manufactured.
2 At that time most non-IBM computers used paper tape having 7 useful bits per row and hence favoured
base 8 (or octal) notation. Fierce arguments raged; one story circulating at the time was that Greek
scholars favoured the term sexadecimal but that IBM were too timid to adopt it. An echo from this era is
found in the ASCII character encoding table, where the basic character set uses only 7-bits.
Appendix F
406
F.2 CUDA Computational Libraries 407
• The PTX ISA Application Guide: This is a detailed explanation of the individual PTX
operations and the associated ISA (instruction set architecture). Essentially it explains in detail
all the machine code operations exposed in CUDA. This is a reference manual, not light reading.
• PTX Interoperability: This is the guide you need to write GPU assembly level kernels
without the help of CUDA and NVCC. It explains the Application Binary Interface
(ABI) which allows your code to compile and interface correctly with the linker to run
on specific devices. This is a relatively short manual.
sanet.st OptiX
F.3 NVIDIA
Recent NVIDIA GPUs, specifically those based on the Turing TU102 and Ampere GA102
chipsets have new raytracing (RT) hardware in the form of one RT unit per SM. These units
are intended to accelerate the high-quality rendering of 3D scenes for gaming and other
visualisation purposes. CUDA itself does not directly expose the RT hardware to program-
mers but does support its use by means of function calls to objects in the additional OptiX
library. The OptiX SDK can be download from https://developer.nvidia.com/designworks/
optix/download. Documentation and example code are included in the download.
The RT units are designed to accelerate the tests of where and if a ray meets an object’s
bounding block. This is exactly the problem we discussed in Section 8.8 with CUDA code to
find intersections with an axis aligned bounding block in Example 8.21. It is likely that use
of RT cores could accelerate this step further. This is potentially useful for both simulation of
systems for detecting ionising radiation and for both the forward and backward projection
steps in tomographic reconstruction. As yet we have not explored this in detail but expect
applications to emerge soon. One caveat is that present GPUs have only one RT unit per SM
so the performance boost might be modest.
and can be run on GPU vectors. We use thrust as a container class throughout this book.
While distributed with the CUDA SDK, thrust is actually an independent project and more
details can be found at https://thrust.github.io/.
• CUB: This is another open-source project providing host level functions to perform
operations such as reduce. Its aims are similar to thrust but unlike thrust, CUB is not
included in the CUDA SDK. The short NVIDIA documentation essentially points to the
official website https://nvlabs.github.io/cub/.
• CUDA Cþþ Standard: This refers to the libcu++ project which aims to provide the
kernel code equivalents to many of the Cþþ standard library (std) functions. It is a set of
include files which are included in the CUDA SDK. The website https://nvidia.github.io/
libcudacxx/ gives more details. We have not made use of these functions in this book.
The above list includes most of the guides and reference manuals you will need to develop
you own GPU applications. However, NVIDIA provides much more documentation aimed
at a wider range of audiences from games developers to administrators of large scale HPC
facilities. Much of this can be accessed directly from the link at the start of this appendix.
Appendix G
Our examples use a collection of header files developed for our own use over a period of
time with some additions for this book. They contain many useful functions, either utilities
or wrappers to simplify some of the more complicated bits of CUDA (textures, for example).
Most of the material is in the cx (CUDA Examples) namespace. When used, this namespace
is explicit in the examples. A small number of commonly used definitions in the main cx.h
file are not in any namespace (e.g. uint); this helps to keep our code compact. The five header
files are shown in Table G.1 and complete listings are given. Obviously when compiling our
examples, the cx headers should be in the path used by the compiler.
Each of these header files are shown in Examples G.1–10 and discussed in detail in the
following sections.
G.1 Thesanet.st
Header File cx.h
This base header file is needed by essentially all our examples. It defines short aliases for
many arithmetic data types for example “ullong” for “unsigned long long”. These
are used widely in our examples, in part to reduce verbosity. Other useful definitions are
wrappers for pointer types used as function arguments. A description of the code and a full
listing of the file follows.
02 #pragma once
. . .
11 // these for visual studio
12 #pragma warning( disable : 4244)// vebose thrust warnings
13 #pragma warning( disable : 4267)
14 #pragma warning( disable : 4996)// warnings: unsafe calls
15 #pragma warning( disable : 4838)// warnings: size_t to int
16 // macro #defines for min & max are usually bad news,
17 // the native CUDA versions compile to single instructions
18 #undef min
19 #undef max
410
G.1 The Header File cx.h 411
File Contents
cx.h Basic definitions used by all examples, short type names and wrappers for pointers.
cx::ok is defined here.
cxbinio.h Simple C style binary IO, we use these a lot for transferring data to and from disk.
cxtimers.h A simple timer class with a very compact interface.
cxtextures.h Wrappers to simplify the use of CUDA textures.
cxconfun.h Just a bit of fun, defines a few maths functions as constexpr so that they can be
evaluated at compile time. Used once in scanner.h in the PET chapter.
20 // cuda includes
21 #include "cuda_runtime.h"
22 #include "device_launch_parameters.h"
23 #include "helper_cuda.h"
24 #include "thrust/host_vector.h"
25 #include "thrust/device_vector.h"
26 #include "thrust/system/cuda/experimental/
pinned_allocator.h"
27 // C++ includes
28 #include <stdio.h>
29 #include <stdlib.h>
30 #include <string.h>
31 #define _USE_MATH_DEFINES
32 #include <math.h>
33 #include <algorithm>
34 #include <float.h> // for _controlfp_s
• Lines 12–15: Suppress certain irritating warning messages from Visual Studio Cþþ.
• Lines 18–19 Remove #define style definitions of min and max. These do more harm than
good. We use either std versions in host code or the fast built-in CUDA intrinsics in
kernel code.
• Lines 21–26: Include most (but not all) of the commonly required CUDA and thrust
support files.
• Lines 28–34: Host Cþþ includes.
• Line 36: Defines a symbol for ASCII escape, used once in the Ising example.
• Lines 37–57: Lots of aliases for unsigned and const versions of native types. The use of the
prefix u for unsigned is common; the use of c for const is less common. These definitions
help keep function declarations compact.
• 59–60: Similar aliases for two CUDA types; you can add more if necessary.
• Lines 62–69: Define four templated aliases for Cþþ pointers using the restrict keyword
combined with const in all possible ways.
Line 63: r_Ptr is defined without either the data or pointer constant. We often use this
with arrays intended for output.
Line 65: cr_Ptr is defined for constant data but the pointer can be variable. We often
use this for input array data.
Lines 67 and 69: Define cvr_Ptr and ccr_Ptr where the pointer itself is a constant
and the data is either variable or constant. These versions are not used in our code.
• Lines 72–73: Define templated aliases thrustHvec and thrustDvec for creation of
thrust host and device vectors and thrustHvecPin for host vectors in pinned memory.
These are often used in our code.
• Line 76: Most array container classes including std::vector and thrust::host_
vector have a member function data() which returns a raw pointer to the start of the
data array. This is occasionally useful for passing to functions which expect a pointer, one
example being the legacy fread and fwrite functions used for binary IO in cxbinio.
h. However, if possible passing a reference to the actual container object is better. In the case
of kernel calls passing a reference is not allowed and a data pointer must be passed. There is
an issue with thrust which is that, although thrust_device vectors do have a data()
member function, this function does not return a raw pointer for passing to kernels. If a is a
thrust device vector then a.data().get() does return a suitable raw pointer. The
function trDptr uses this to return a raw data point from a thrust_device vector
passed by reference. This function is intended to be used on the host.
• Line 78: The a.data().get() function is undocumented and the recommended
alternative is to use thrust::raw_pointer_cast() with either a.data() or
&a[0] as an argument. An alternative definition of trDptr is shown here using the
raw_pointer_cast. It is commented on in this version of cx.h but can be uncom-
mented if you do not like using undocumented features.
In our examples we don’t make use of trDptr but prefer to use a.data().get()
directly as a kernel argument as this use matches our practice with host container classes. If
we were to use the raw_pointer_cast alternative version, the trDptr wrapper
defined here would significantly reduce the verbosity of the resulting kernel calls.
• Line 79: Everything else in this header and all other cx headers is inside the cx namespace.
• Lines 81–83: Define templated constexpr symbols pi, pi2 and piby2 for π, 2π and
π/2. Many scientific programs do something like this for π but using constexpr here
means that the evaluations are done with maximum precision at compile time. The default
414 The CX Header Files
template type is set to float so that cx::pi<> and cx::pi<double> can be used for
float or double values.
• Lines 84–88: The utility function tail returns the portion of the cstring s after the last
appearance of the delimiter character c. It is used in the function codecheck below to
strip the path from a complete file name.
• Lines 90–100: Here we define the macro cx::ok which can be used to check for
CUDA errors. It is closely based on the NVIDIA checkCudaErrors function in
helper_cuda.h. It is slightly less verbose and can be changed to return instead of
directly exiting the program.
Line 90: Declares the function codecheck which does all the work. The arguments
are code the cuda call to be carried out, file a cstring containing the fully qualified
name of the file containing the code, line a cstring containing the line number in the
code where the call to cx::ok occurs and call a cstring containing the CUDA call.
Line 92: Performs the CUDA function call code and checks the return code.
Lines 93–94: If an error occurs it prints an informative message and exits the program.
Note the use of cudaGetErrorString to convert the CUDA error code to a
meaningful text string.
Line 95: An alternative to directly exiting is to return with a non-zero error code here.
Line 100: The definition of cx::ok as a macro. We have used very few macros in our
code, but here the compiler adds a lot of extra value, by providing both the file name and
line number where cx::ok detected the error.
Line 102: An alternate definition of cx::ok which does nothing; this could be used in
debugged code to give slightly better performance.
template <typename T> int read_raw Read len words from named file to buf.
(const char *name, T *buf, size_t
len, int verbose=1)
template <typename T> int write_raw Write len words to named file from buf.
(const char *name, T *buf, size_t An existing file will be overwritten.
len, int verbose=1)
template <typename T> int append_raw Append len words to named file from
(const char *name, T *buf, size_t buf. A new file is created if necessary.
len, int verbose=1)
template <typename T> int Read len words from named file to buf
read_raw_skip(const char *name, starting after skip bytes.
T *buf, size_t len, size_t skip,
int verbose=0)
template <typename T> size_t Returns length of named file in words of
length_of(const char *name, size T bytes.
int verbose=0)
int can_be_opened(const char *name) Returns 1 if named file can be opened
otherwise returns 0.
Another issue with binary files is that they might not be portable between computers of
different architectures. Specifically moving between little and big-endian will cause byte-
swapping errors.
Some care is needed when opening named sanet.st files for writing or appending. We use the
parameters “wb” or “ab” in the open statements which will cause any existing file of the
same name to be overwritten. If this is what you want, as is often the case when developing
code, that is fine but in other situations care must be taken. In our examples, command line
arguments are always arranged so the input file name is specified before the output file name.
Since Cþþ11, it is possible to use "wbx" which causes an error when opening an existing
file. This could then be used to ask the user if they wanted to continue. Our function
can_be_opened can be used to check if a file exists before passing it to a cx read or
write function.
The functions are widely used in our examples but here we include another example
illustrating their use to merge a set of files. This example is followed by a detailed descrip-
tion of the cxbinio code.
01 #include "cx.h"
02 #include "cxbinio.h"
16 while(cx::can_be_opened(&name[0])){
17 if( cx::read_raw(&name[0],buf.data(),size,0)==0)
18 cx::append_raw(argv[2],buf.data(),size);
19 file++;
// next file in sequence
20 sprintf(&name[0],"%s%4.4d.raw",argv[1],file);
21 }
• Line 14: Create a vector name of type char to hold the name of the current input file. It will be just
big enough to hold the string generated by sprintf in lines 15 and 20. Note we need to include an
extra character to hold the “0” byte that terminates a C-string. This approach is better than declaring
a fixed size array, for example, “char name[256];” and hoping the user does not input a very
large string.
• Lines 15: Uses sprintf to create the full name for the first file and store the result in the C string
name. Note the use of %4.4d to create a zero padded value.
• Lines 16–21: This while loop processes each file in the sequence.
◦ Line 16: Checks if the current input file exists using cx::can_be_opened; the loop will
terminate if it does not.
◦ Line 17: Calls cx::read_raw to read the current file and check the return value for success. We
use the optional fourth argument to suppress the function’s “file read” message.
◦ Line 18: Calls cx::append_raw to append the contents of buf to the output file.
◦ Line 19: Increment file to point to the next file in the sequence.
◦ Line 20: Uses sprintf to update the file name in name ready for the next pass through the
while loop.
• Lines 22–23: Print a final message and exit.
Output from running the binio program on a set of 10 test files is shown in the
box below.
. . .
06 #include <stdio.h>
07 #include <stdlib.h>
08 #include <string.h>
. . .
G.2 The Header File cxbinio.h 419
20 namespace cx {
21 // read an existing file.
22 template <typename T> int read_raw(const char
*name, T *buf, size_t len, int verbose=1)
23 {
24 FILE *fin = fopen(name,"rb");
25 if(!fin) { printf("bad open on %s for read\n",
name); return 1; }
26 size_t check = fread(buf,sizeof(T), len, fin);
27 if(check != len) {
28 printf("bad read on %s got %zd items expected
%zd\n", name, check, len);
29 fclose(fin); return 1;
30 }
31 if(verbose)printf("file %s read\n", name);
32 fclose(fin);
33 return 0;
34 }
Description of cxbinio.h
• Lines 6–8: Standard include files.
• Line 20: Everything is inside the cx namespace.
• Line 22: Declaration of the read_raw template function. The arguments are name a cstring
containing the name of the file to be read, buf a pointer to an array of type T holding the data to be
read, len the number of words to be written and verbose a flag to control printing. Note name
can include a path.
• Line 24: Open the file for binary read, the handle fin will be null if an error occurs (e.g. the file does
not exist).
• Line 25: print message and return if open was unsuccessful.
• Line 26: Perform the read operation using fread. Note two parameters specify the amount of data
to be read, the word length (sizeof(T) here) and the number of words (len here). The function
returns the number of words actually read.
• Lines 27–30: If the correct number of words has not been read, we print an error message, tidy up
and return with a non-zero error code.
• Lines 31–33: If the correct number of words have been read, then we optionally print the filename
and return with a zero error code.
• Lines 36–48: The function write_raw is almost identical to the read_raw except we open the
file for binary write using “wb” instead of “rb” and call fwrite in line 40 instead of fread.
• Lines 50-64: This is the code for the function append_raw. This function is almost identical to
write_raw() except that the output file is opened with “ab” instead of “wb” which means that
data is appended to the end of the file instead of overwriting any previous contents. If the file does
not already exist, it will be created as if “wb” had been specified.
• Lines 65–81: The code for the function read_raw_skip. This function is nearly the same as
read_raw except that read process starts after the first skip bytes of data. The parameter skip is
an additional input argument to this function. Note it is the user’s responsibility to ensure the skip
is a multiple of sizeof(T), otherwise memory alignment issues may occur.
• Lines 83–92: The function length_of which returns the length of the named file in words of size
determined by the template parameter T. The function is implemented by using fseek to move
the file position point to the end of the file and then using ftell to find the value of the pointer.
• Lines 94–100: The function can_be_opened does what you expect; it tries to open the named file
for read and if successful returns one, otherwise it returns zero.
Before ending our discussion of binary IO it is worth mentioning that the ImageJ program
discussed in Section 5.7 is a great tool for viewing the contents on binary files. One of its
422 The CX Header Files
options is to replicate our Example G.4 and import a sequence of 2D images to a 3D stack.
Such 3D stack can then be viewed and manipulated in many ways.
. . .
16 // provides a MYTimer object for host bast elapsed time
measurements.
17 // The timer depends on the C++ <chrono>
18 // usage: lap_ms() to returns interval since previous lap_ms(),
19 // start or reset.
21 #include <cstdio>
22 #include <cstdlib>
sanet.st
23 #include <chrono>
25 namespace cx {
26 class timer {
27 private:
28 std::chrono::time_point<std::chrono::high_resolution_clock> lap;
29 public:
30 timer(){ lap =
std::chrono::high_resolution_clock::now(); }
31 void start() { lap =
std::chrono::high_resolution_clock::now(); }
32 void reset() { lap =
std::chrono::high_resolution_clock::now(); }
34 double lap_ms()
35 {
36 auto old_lap = lap;
37 lap = std::chrono::high_resolution_clock::now();
38 std::chrono::duration<double,std::milli>
time_span = (lap - old_lap);
39 return (double)time_span.count();
40 }
G.4 The Header File cxtextures.h 423
42 double lap_us()
43 {
44 auto old_lap = lap;
45 lap = std::chrono::high_resolution_clock::now();
46 std::chrono::duration<double,std::micro>
time_span = (lap - old_lap);
47 return (double)time_span.count();
48 }
49 };
50 }
Description of cxtimers.h
• Line 23: The necessary header file <crono> is included here.
• Line 26: Declaration of the timer class.
• Line 28: The single member variable lap is declared here. It is of a type suitable to hold a single
time measurement. This is intended to hold the start time for an interval measurement.
• Line 30: This is the default and only constructor. It sets lap to the current time. Thus, in simple use
cases the statement cx::timer tim; both creates the timer object tim and sets the start time for
an interval measurement.
• Lines 31–32: The two member functions start and reset both have the same effect of resetting
the start time for an interval measurement to the current time. In a later release they may have
different functions. These functions allow you to reuse a previously created timer.
• Lines 34–40: The member function lap_ms updates lap with the current time and returns the time
interval between the new and old values of lap. The time interval is returned as a double in units
of ms. Notice lap is reset to the current time by this call.
• Lines 42–49: This function is identical to lap_ms except the time interval is returned in units
of microseconds.
For measuring overlapping time intervals, for example, the total job duration and the times
of individual sections within the job, one can simply create multiple timer objects.
cudaMalloc nor thrust can be used for this purpose. A total of five separate steps on the
host are needed to create a texture object:
1. Allocate device memory using cudaMallocArray or cudaMalloc3DArray.
The array dimensions and type are specified by creating a special
cudaChannelFormatDesc structure and passing it as an argument to the allocation
function. Note that this is where we specify the array type as a template parameter (T in
our code) to the cudaCreateChannelDesc function. Allowed types are float, half
and 1,2 or 4-byte integers. Built in vectors of length 2 or 4 are also allowed, e.g. float2
and float4 but not float3.
2. Copy host data to the allocated device memory using either cudaMemcpyToArray (1D
or 2D) or cudaMemcpy3D (3D). The layout of your data after transfer to the device
may be different.
3. Create a cudaResourceDesc stuct which holds the array pointer created in step 1.
4. Create a cudaTextureDesc struct which contains fields specifying various
optional properties of your texture. The possible flags as shown in Table G.3 with
brief descriptions.
5. Declare an instance of a cudaTextureObject and set its properties by calling
createTextureObject with arguments that include the structs created in step
3 and step 4.
G.4 The Header File cxtextures.h 425
As indicated in the table there are several possible types of texture supported by the hardware
and CUDA. In addition to the standard texture type used in our examples there are also
layered textures and surfaces. A layered texture can be thought of as a stack of 1D or 2D
textures with each slice separately addressable. A layered 2D texture would be suitable for
processing a set of movie frames or MRI slices where each slice needed similar but separate
processing. A surface texture is similar to a standard texture but can be written to be in kernel
code. There is no attempt to maintain texture cache coherency between reads and write
during kernel execution so in general any given kernel should only read or write to a surface.
The maximum sizes of standard textures are shown in Table 5.1 of Chapter 5. For a
device of CC ≥ 6 the maximum dimensions for 1D or 2D layered textures are 16384 or
16384 16384 with up to 2048 layers. Full details of the limits for other texture types and
required CC levels is given in the Compute Capabilities section of the NVIDIA CUDA Cþþ
Programming Guide.
Kernels read from standard textures using the templated functions tex1D, tex2D and
tex3D as discussed in Chapter 5. Other texture types use similar functions which are
explained in the CUDA Programming Guide.
The cxtextures.h header provides objects txs1D, txs2D and txs3D which create
standard textures and act as container classes for the cudaArray allocated in device
memory to hold the textures. To use these classes, you only need to provide a pointer to
the host array containing your data. The interface is similar for the three classes. To create a
texture, provide the constructor with the array dimensions as an int1, int2 or int3
variable and a list of option flags as shown in the box:
The member functions copyTo(T *data) and copyFrom(T *data) can be used
by mytexN to transfer data between the host and GPU texture memory. Also, the public
variables n and tex give access to the texture dimensions and the texture object itself. Thus
one can use mytexN.n and mytexN.tex as kernel arguments.
A listing of cxtextures.h and descriptions follow.
. . .
06 #include "cuda_runtime.h"
07 #include "device_launch_parameters.h"
08 #include "helper_cuda.h"
09 #include <stdio.h>
426 The CX Header Files
10 #include <stdlib.h>
11 #include <string.h>
12 //=========================================================
13
14 // Provides the C++ classes txs1D, txs2D and txs3D which
15 // hold CUDA textures of the indicated dimension. These
16 // classes have a simple interface for creating textures.
17 // The user supplies data for the texture as a simple
18 // pointer to a host array. The class allocates and manages
19 // the device texture memory held in a texture object which
20 // can be passed to CUDA kernels. The user also specifies
21 // values for the five CUDA texture options:
// default constructor
57 txs2D(){ n ={0,0}; carray = nullptr; tex = 0; }
cudaTextureFilterMode filtermode,
cudaTextureAddressMode addressmode,
cudaTextureReadMode readmode,
int normmode, int arrayType=cudaArrayDefault)
59 {
60 n = m; tex = 0; carray = nullptr;
61 cudaChannelFormatDesc cd = cudaCreateChannelDesc<T>();
62 cx::ok(cudaMallocArray(&carray,&cd,n.x,n.y,arrayType));
63 if(data != nullptr){
64 cx::ok(cudaMemcpyToArray(carray, 0, 0, data,
n.x*n.y*sizeof(T), cudaMemcpyHostToDevice));
65 }
88 if(carray != nullptr){
89 if(tex != 0) cx::ok(cudaDestroyTextureObject(tex));
90 cx::ok(cudaFreeArray(carray));
91 }
92 }
93 }; // end class txs2D
126 cx::ok(cudaMalloc3DArray(&carray,&cd,cx,arrayType));
127 if(data != nullptr)
copy3D(data, cudaMemcpyHostToDevice);
138 cx::ok(
cudaCreateTextureObject(&tex, &rd, &td, nullptr) );
139 }
140 void copyTo(T *data) { // copy from data to texture
141 if(data!=nullptr && carray!=nullptr)
copy3D(data,cudaMemcpyHostToDevice);
142 }
143 void copyFrom(T *data) { // copy to data from texture
144 if(data!=nullptr && carray!=nullptr)
copy3D(data,cudaMemcpyDeviceToHost);
145 }
146 ~txs3D() { // destructor does nothing if this instance is a copy
147 if(carray != nullptr) {
148 if(tex != 0) cx::ok(cudaDestroyTextureObject(tex));
149 cx::ok(cudaFreeArray(carray));
150 }
151 }
152 }; // end class txs3D
Description of cxtextures.h
Note this header file also contains the definition of the class tex1D which is almost identical
to tex2D. In fact, CUDA implements a 1D texture of length nx as a 2D texture having dimensions
nx ×1. Therefore, the class tex1D is omitted from the listings and discussion here. The full listing is
of course in our code repository.
• Lines 51–97: Contain the definition of the txs2D template class. The class txs1D (not shown
here) is almost identical to txs2D except that n becomes an int1 variable and n.y is replaced
by 1.
• Lines 52–53: The single private variable carray is declared here. This is a pointer to the
cudaArray used to hold the texture. The caller does not need to access this array directly.
• Lines 54–56: Here the two public class variables are declared; n is an int2 variable that holds the
dimensions of the 2D texture and tex is the 2D texture object to be created and is suitable for
passing to GPU kernels as an argument.
• Line 57: This is the default class constructor; it is not intended to be used.
• Lines 58–79: This is the constructor used to create instances of txs2D objects.
• Line 58: The constructor arguments are as follows:
◦ int2 m: m.x and m.y: These are the dimensions of the 2D array.
◦ T *data: This is the pointer to a host array used as the source of the texture data. Note the
template parameter T for the data type.
◦ cudaTextureFilterMode filtermode: the required filter mode
◦ cudaTextureAddressMode addressmode: the required address mode
◦ cudaTextureReadMode readmode: the required read mode
◦ int normmode: the required normalisation mode
◦ int arrayType: the required array type. Note a default value of cudaArrayDefault is
supplied for this parameter so it can be omitted by the caller. This argument is intended for future
development and should not be changed.
• Line 60: The class variable n is set to m and the other class variables are cleared.
• Line 61: Declare and initialise the cudaChannelFormatDesc variable cd using
cudaCreateChannelDesc<T>(). The only parameter is T, but the apparent simplicity here
sanet.st
is misleading; CUDA supplies different objects using specialised instances of the function
cudaCreateChannelDesc for all valid choices of T. Code using early versions of the SDK
may be more verbose here.
• Line 62: Allocates a 2D CUDA array in GPU memory and copies a pointer to carray. Note we
have wrapped the allocation in cx::ok to get feedback on errors. This is done for most of the
critical CUDA calls in this header file.
• Lines 63–65: The host data is copied to the GPU here; changes to the layout in memory may occur.
• Lines 66–68: Make a cudaResourceDesc rd to hold the pointer carray.
• Lines 69–74: Make a cudaTextureDesc td to hold the texture mode options.
• Line 75: Create the texture object tex using rd and td.
• Line 78: Definition of a copy constructor. The copy will hold current copies of the public variables
but carray is set to null so that the copy’s destructor will not attempt to free resources.
• Lines 79–82: The member function copyTo which allow users to change the data stored in the
texture – obviously this call is only used between kernel calls. Note the texture does not need to be
recreated to do this.
• Lines 83–86: The member function copyFrom which allows data to be copied back from the
texture to the host. This is not normally useful for standard textures but would be useful for similar
code using surfaces.
• Lines 87–92: The class destructor which frees the resources allocated by the call. This automatic
freeing of resources is the reason container classes help simplify your code. Note copies of the class
made using the copy constructor will not free resources when they go out of scope. Thus there is a
tacit assumption that the parent instance of this class will go out of scope last. A copy count could be
used here if you wanted to be extra careful.
• Lines 101–152: These are the body of the txs3D class which provides support for 3D textures.
G.5 Morton Ordering 431
• Lines 101–106: The start of this class is the same as txs2D with the corresponding declarations of
the member variables carray, n and tex except that n is int3 rather than int2.
• Lines 108–109: The default and copy constructors are declared here.
• Lines 110–118: This is a new member function copy3D for copying data between the host and
GPU. For 3D textures GPU memory needs to be allocated using cudaMalloc3DArray rather
than cudaMallocArray. Transferring data to or from such 3D arrays is more verbose than the
just calling one of cudaMemcpyToArray or cudaMemcpyFromArray. For 3D we need
cudaMemcpy3D and this function takes a cudaMemcpy3DParms object as its one argument.
Since we need this operation several times we have written a separate function for this task.
◦ Line 110: The arguments are data, the pointer to the host data, and a cudaMemcpyKind flag
copykind indicating the direction of transfer. The class member variables carray and n are
also used.
◦ Line 112: Create and clear the cudaMemcpy3DParms object cp.
◦ Lines 113–116: Set the various fields in cp according to the CUDA recipe book. Both 2D and 3D
cudaArray types are actually allocated with the x-dimension rounded up to be a multiple of 512
bytes; the PitchedPtr created in line 113 allows for this. Notice n.z is not used here but is
used in line 115.
◦ Line 117: Call cudaMemcpy3D – the actual call is nicely concise.
• Lines 119–139: This is the constructor used to create instances of txs3D objects. The arguments are
identical to those of txs2D except the array dimensions m are specified with an int3 variable
instead of int2.
• Lines 123–124: These are the same as 60–61 for txs2D.
• Line 125: Here we create a cudaExtent object which holds the array dimensions in bytes. Such
objects are only used for the 3D case.
• Line 126 Here we allocate the array using cudaMalloc3DArray instead of
cudaMallocArray.
• Line 127: Here we call copy3D to transfer host data to the newly allocated GPU array. Note the
transfer direction flag is cudaMemcpyHostToDevice.
• Lines 128–137: Here we create resource and texture description objects this is the same as before
except that td has a third dimension specified in line 134.
• Line 138: Here we finally create the texture.
• Lines 140–150: The two data copy functions and class destructor are the same as before except that
we use our copy3D function instead directly calling CUDA transfer functions.
Elements in 2D textures and slices of 3D textures are laid out to optimise local 2D
addressing with small strides between elements. This is discussed next.
sanet.st
Figure G.2 Two 16 16 arrays are shown with each element’s position in memory shown
in the boxes. Morton indexing is shown on the left and conventional row-major indexing is
shown on the right. In the Morton case the indices for nearest neighbours in x and y tend to
have similar values but when using conventional linear addressing the indices differ by 16.
In the Morton case increasingly larger index jumps occur at increasingly large power of
2 boundaries.
14 //=========================================================
15 // small set of constexpr math functions to be evaluated at
16 // compile time. Power series for sin and cos good for angles
17 // in [-2pi,2pi]. The tan function cuts off singularity at
18 // 10^9. We need to specify a fixed number of terms or
19 // iterations to keep compiler happy.
20 //============================================================
21 namespace cx {
// factorial n only works for n <= 12
22 constexpr int factorial_cx(int n)
23 {
24 int k = 1;
25 int f = k;
26 while(k <= n) f *= k++;
27 return f;
28 }
// sin(x) x in radians
29 constexpr double sin_cx(double x)
30 {
31 double s = x; int nit = 1;
32 double fnit = 1.0;
33 double term = x;
34 while(nit < 12) { // compile time evaluation
35 term *= -x * x / (2.0*fnit*(2.0*fnit + 1.0));
36 s += term;
37 nit++; fnit++;
38 }
39 return s;
40 }
// cos(x) x in radians
41 constexpr double cos_cx(double x)
42 {
43 double s = 1;
44 int nit = 1; double fnit = 1.0;
45 double term = 1.0;
46 while(nit < 12) { // compile time evaluation
47 term *= -x * x / (2.0*fnit*(2.0*fnit - 1.0));
48 s += term;
49 nit++; fnit++;
50 }
51 return (float)s;
52 }
// tan(x) x in radians
53 constexpr double tan_cx(double x)
54 {
434 The CX Header Files
55 double s = sin_cx(x);
56 double c = cos_cx(x);
57 double t = 0.0;
58 if(c > 1.0e-9 || c < -1.0e-09) t = s/c;
59 else if(c >= 0.0) t = s/1.0e-09;
60 else t = -s/1.0e-09;
61 return t;
62 }
// square root of abs(x)
63 constexpr double sqrt_cx(double x)
64 {
65 // return root of abs
66 if(x < 0) x = -x;
67 // NB sqrt(x) > x if x < 1
68 float step = (x >= 1.0) ? x/2.0 : 0.5;
69 float s = (x >= 1.0) ? x/2.0 : x;
70 int nit = 32; // explicit for compile time evaluation
71 while(nit >0) {
72 if(s*s > x) s -= step;
73 else s += step;
74 step *= 0.5;
75 nit–;
76 }
77 return s; sanet.st
78 }
69 }
80 // end file cxconfun.h
The code in these functions is straightforward. We use repeated multiplication for factorial
n. The integer data type used in lines 24 and 25 means the result will have overflow errors for
n > 12. This could be improved by changing the type to double which gives exact value up to
n=18 and 15 significant figures of accuracy thereafter. The unsigned long long data
type could also be used to extend the precise range slightly further.
The sin and cos functions sum the first 12 terms of the standard power series which gives
accurate results for small angles, ideally adjusted to be in the range [-π,π]. The tan function is
evaluated from the ratio of sin and cos with a cut-off near the singularity where cos is zero.
The square root function uses a binary chop iteration with 32 steps starting from a guess.
A different guess is used for the cases x <1 and x >1.
The examples in our book use the tan function once in the PET chapter. The other
functions are not used at all and we do not claim any of these functions are efficient.
Appendix H
AI and Python
One of the most important current applications of GPUs is the field of artificial intelligence
(AI) or more precisely machine learning (ML) or deep learning. In this field, a training data
set is used to train a neural network to perform some data analysis task, for example,
recognising cats in digital images. Once trained, the neural network can be used to process
new data. Such trained networks are in daily use now for a large and growing variety of
tasks. Recent NVIDIA GPU hardware innovations, including tensor cores, have been aimed
specifically at these applications.
AI developers for the most part use high-level programming tools, usually Python-based,
to link pre-existing modules to build neural networks of varying complexity and then train
and deploy them. The low-level tools themselves are doubtless written in Cþþ and use
carefully optimised CUDA kernels, but most users will not be exposed to those details. In
this appendix we list some of the rich library of AI tools available from NVIDIA for
AI developers.
Finally, we also mention some ways in which Python programs can write kernel code
directly, bypassing Cþþ. It turns out that the resulting kernels are written in the same CUDA
as we use throughout this book; it is just that Python maintains the interface arrays. We think
our book is certainly useful for such developers.
435
436 AI and Python
• TensorRT
The NVIDIA TensorRT is an SDK for high-performance deep learning inference. It
includes a deep learning inference optimiser and runtime that delivers low latency and
high throughput for deep learning inference applications. The core of NVIDIA TensorRT
is a Cþþ library that facilitates high-performance inference on NVIDIA GPUs. TensorRT
takes a trained network, which consists of a network definition and a set of trained
parameters and produces a highly optimised runtime engine which performs inference
for that network.
• Triton Inference Server
The NVIDIA Triton Inference Server (formerly TensorRT Inference Server) provides a
cloud inferencing solution optimised for NVIDIA GPUs. The server provides an inference
service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for
any model being managed by the server.
• NCCL
The NVIDIA Collective Communications Library (NCCL) is a library of multi-GPU
collective communication primitives that are topology-aware and can be easily integrated
into applications. Collective communication algorithms employ many processors working
in concert to aggregate data. NCCL is not a full-blown parallel programming framework;
rather, it is a library focused on accelerating collective communication primitives.
• DALI
The NVIDIA Data Loading Library (DALI) sanet.st is a collection of highly optimised building
blocks, and an execution engine, to accelerate the pre-processing of the input data for deep
learning applications. DALI provides both the performance and the flexibility for acceler-
ating different data pipelines as a single library. This single library can then be easily
integrated into different deep learning training and inference applications.
Topics in Cþþ
This section is not intended to teach you Cþþ; rather it highlights a few topics that we have
found to be useful in our, rather long, programming life. Cþþ is an enormous language and
continues to grow with each new revision. I expect readers to have some familiarity with the
basics of C (which is a subset of Cþþ) and Cþþ but by no means expert knowledge. This
book is about writing code for scientific applications – that means processing data, simula-
tion of experiments and perhaps theoretical calculations. We only use a few of the advanced
features of modern Cþþ; we do use some of the features introduced with Cþþ11 but little
beyond that. We do not use OOP (object orientated programming) in any real sense; we have
a few classes with their own methods but there are no class hierarchies here. On the other
hand, we like template functions and some details like giving function arguments default
values. Our programming is mostly algorithmic and we do not think anyone with some basic
programming experience will find our code difficult to follow. Each piece of code that we
sanet.st
present is accompanied by a rather detailed line-by-line discussion.
438
I.1 Coding Style 439
We also like the RAII (resource acquisition is allocation). In practice, this means that our
variables are declared and initialised in a single statement whenever possible. Annoyingly,
CUDA SDK functions are often used to initialise variables passed as function arguments; in
these cases we try to put both declaration and initialisation on the same line of code:
Use:
float *a; cudaMalloc(&a, asize*sizeof(a));
rather than;
float *a;
cudaMalloc(&a, asize*sizeof(a));
For this reason, we also rather like the C/Cþþ ternary operator (? :) even though it can
make code look rather opaque until you get used to it. The interesting feature of the ternary
operator is that it returns a value and hence can be used anywhere a value is required which is
not the case for an if/else statement. A good example of its use is setting variables from
the user settable command line parameters. This is shown in section I.1.3 below.
Try as we may we are unable to warm to the Cþþ <iostream> classes. If you care
about the layout of results from a calculation and want fine control of the number of
significant figures, then these classes are simply too verbose. Instead, we use printf
which is compact and gives good control over the final layout.
Command:
C: >prog1.exe infile outfile 256 128
Values in main:
argc = 5
argv[0] = prog1.exe argv[1] = infile argv[2] = outfile
argv[3] = 256 argv[4] = 128
Where test is a logical test returning either Boolean true or false and the expression
evaluates to either val1 if the test is true or val2 if the test is false. The box shows an
example of how we use the ternary operator to set a parameter from either the command line
option if present or a default if not.
This example shows how one might set an array size parameter from fourth user
supplied parameter on the command using a default value of 256. Most of our examples
have this style of user interface. One disadvantage of this simple method is that users have
to specify options in the correct order and have to give explicit values for all options
preceding any option whose default value is to be changed. A more robust approach is
obviously desirable for production code; the cxoptions.h code in our repository is
one possibility.
1. By value – in this case, when the function is called, the compiler makes a copy of the
item being passed and passes that copy to the function. During execution a function may
change the value it received but since these are changed in a copy, the caller’s version of
the item passed will not be changed when called function returns. This method is good for
passing single numerical values but problematic for passing arrays as the entire array
would need to be copied and that is expensive or impossible for large arrays.
2. As a pointer – in this case a pointer to the item in the caller’s memory space is passed. If
the argument is the array a then the pointer is passed as *a and the function can access
the elements as a[index]. If the argument is a single variable then the argument can be
accessed as either *a or a[0]. If pointers are used the function can change the caller’s
version of the item by simply writing to it. This can be prevented if the function declares the
contents of the item as constant by using int * const a instead of simply int * a in
the function argument list.
3. As a reference – here the function receives a reference to the item in the caller’s memory
space. If the reference is passed as say int &a then if a is an array its elements can be
accessed as a[index] just like the pointer case. If a is just a simple variable it can be
read or written to as if it had been passed by value, but any change will be seen by the
caller because the caller’s version of a is directly accessed by the function. Changing the
contents of a can be prevented by declaring the argument const int &a. Passing by
reference is better than passing by value for large objects unless there is a specific reason
for making a copy, for example, the function makes changes which should not be seen by
the caller.
Note that in CUDA programs, the arguments passed by the host to a kernel function must be
either passed by value or as a pointers to previously allocated GPU main memory. Items
passed by value are automatically copied to GPU memory. GPU constant memory space will
be used for these items if possible. Arguments cannot be passed by reference. A CUDA
kernel cannot return a value back to the host via the arguments or a return value (which must
be declared void). If you are using CUDA managed memory allocations then some of these
restrictions are implicitly lifted because the same physical memory might be used by both the
host and the GPU.
actually use the undocumented function A.data().get() and we use this directly in most
of our examples for kernel arguments thus saving the need to clutter code with otherwise
unnecessary pointer variables.
For numerical work we think the good old for loop cannot be bettered and, therefore, we
have used simple for loops throughout this book. However, Cþþ introduced the concept of
an iterator as a generalisation for the traditional integer loop counter. Our Example I.1 shows
three ways of looping over the elements of a vector in modern Cþþ.
01 #include <stdio.h>
02 #include <stdlib.h>
03 #include <vector>
21 return 0;
22 }
D:\ >iter.exe
a[20] = 20
a[20] = 120
a[20] = 220
I.3 Container Classes and Iterators 443
Notice the range-based loop in line 19 hides all the details of the order in which elements are
processed. This is just the sort of syntax we might use in a parallel program. Assigning
values using kþþ is arguably a potential bug in line 19; it will only work if elements are in
fact accessed in their natural order – in fact the current Cþþ standard currently guarantees
statements will be executed sequentially in the same order as a standard for loop.
Container classes have other uses in Cþþ, std::vector objects can contain any type
including any built-in or user defined class. The standard library contains other types of
container class such as maps, lists and queues, these are intended for non-numerical work
and more information can be found online or in any good book on Cþþ. We do not need any
of these features in this book.
Templates
We really like Cþþ templated functions; they are an elegant solution to the problem of
writing a set of functions that perform the same operation on different data types. Suppose
444 Topics in Cþþ
we want to write a function to calculate a saxpy-like linear combination a*X+Y where X and
Y are one type and a is either the same type or a different type. The box shows two versions
of the function, one standard version for the case where X and Y are floats and a is an int
and one templated version which works for all types for which the (possibly overload)
operators * and þ are defined.
Standard function
float saxpy1(float x, float y, int a) {return a*x+y;}
e.g. float z = saxpy1(x,y,5); where x and y have type float.
Templated function
template <typename T, typename S> T saxpy2(T x, T y, S a) )
{return a*x+y;}
e.g. float z = saxpy2(x,y,5); where x and y have type float.
or float3 z = saxpy2(x3,y3,5); where x and y have type float3.
The way template functions work is that whenever the compiler finds a call to saxpy2 in
your code it will examine the arguments and replace T with whatever common type x and y
have and replace S with the type of a throughout the function. Note in this case the return
type of the function is specified as T. The compiler will generate separate functions for each
different combination of types encountered in the complete program. Often short template
sanet.st
functions will be automatically inlined. Sometimes the compiler needs help deciding on
which types to use, and in that case you can specify the types explicitly, for example,
saxpy2<float3,int>(x,y,5). CUDA kernels and device function can be templated
and it is always necessary to supply explicit template parameters when launching
templated kernels.
Template parameters can either be types preceded by the keyword typename (or class
for historic reasons) or an integer preceded by the keyword int. Templated integer values
are useful in cases where the compiler needs to know their values, for example, fixed array
dimensions or const values and also in cases where a hint for unrolling for loops or other
optimisations is useful.
Templates can also be applied to class or struct definitions and to constexpr
definitions.
I.4 Casts
In C/Cþþ a cast is used in expression to change the type of the value held in an object, for
example, to convert a float value to an int value. Note it is the value that is converted not
the declared type of the object concerned. (Beware Python is not the same; here the type of a
named variable can change.)
For arithmetic work this is conceptually straightforward, although care is required to avoid
unintended conversions. In Cþþ casts can also be used to change class objects typically
I.5 Cstrings 445
moving them up and down class hierarchies – this can get really complicated but fortunately
we need none of this in our book. A simple example of float to int casing is shown in
the box:
float pi = 3.145;
In line (a) the value in pi is implicitly rounded to an integer and the result stored is stored
in p. Rounding occurs towards zero; thus p will hold the value 3. The statement p = -pi;
would store the value -3 in p.
Line (b) has the same effect as (a) and would suppress any compiler warning. Very
importantly it tells anyone reading the code that the programmer intended a conversion to
take place. We use this form of cast in numerical expressions throughout the book.
Line (c) is an alternative version of (b) available in Cþþ; we prefer (b) because (c) is too
easily confused with function syntax.
Line (d) is the recommended Cþþ version for this case. Most books on Cþþ will tell you
that casting is undesirable and the ugly syntax is intended to discourage use of casts. This
may well be true when playing with classes but it is certainly not true when crafting mixed
precision expressions to achieve maximum compute performance on CPUs or GPUs. We
unashamedly use C casts of type (b) throughout our code.
The C casting style will also work on pointers but in Cþþ we need to use reinterpret_
cast instead of static_cast. In some cases, we do choose the verbose Cþþ version to
emphasise the slightly tricky nature of the code. Our vector-loading kernels are an example of this.
Cþþ also has const_cast for changing the const nature of objects and dynamic_
cast for playing with classes. Neither of these are used in this book.
I.5 Cstrings
The C and Cþþ types char and unsigned char are 8-bit integer types on an equal
footing with the other 16, 32 and 64-bit integer types. With renewed interest in mixed
precision arithmetic, they have an important role to play.
Arrays of type char containing ASCII character codes have long been used to represent
character strings in C; such strings are terminated by the first zero character encountered
when reading the string sequentially from the start. In a bug-free world of benign users such
strings are perfectly acceptable. In the real world, the fragile termination convention used by
cstrings has been the source of many bugs and enabled hostile attacks. Cþþ introduced a
proper string class with a much more robust management of string operations. Hence
446 Topics in Cþþ
cstrings are now rather deprecated. Nevertheless, we make some use of cstrings in our book,
mainly to handle command line arguments via the argc and argv variables.
One example where cstrings are useful is passing filenames to the cx::binio routine
read_raw:
(a) cx::read_raw("indata.raw",inbuf,1000);
(b) cx::read_raw(argv[1],inbuf,1000);
Where read_raw is declared as
(c) int read_raw(const char *name, float *buf, int size);
In line (a): We call read_raw with an explicit string literal as the name of the file to be
read. Note in both C and Cþþ, string literals like this are cstrings.
Line (b): This is similar except the cstring is contained in char * argv[1]. This is
used with command line arguments.
Line (c): Shows the declaration of read_raw for data of type float; the first argument
which receives the cstring is declared as const char *name. The const qualifier is
mandatory in modern Cþþ and mitigates some of the dangers inherent in cstrings.
I.6 Const
The const keyword can be used in both Csanet.st
and Cþþ to qualify the type in a declaration; this
means that once initialised the declared object cannot be changed. A full discussion of all the
gory details surrounding const can be found in textbooks; here we briefly describe what is
done in this book.
Some Cþþ texts are very keen to emphasise const correctness in code, which roughly
means all items that are not changed in a function must be declared const. Our experiments
suggest that adding const to kernel or device function arguments does not yield any
performance gains. This is in contrast to the restrict keyword which can make a big
difference. In fact, we do try to use const with pointers to input data buffers, but we may
sometimes omit it for scalar parameters such as array dimensions. The use of const for
such parameters is to protect programmers from accidently changing them; this is certainly
good practice but in simple cases, especially if these parameters are not passed to other
functions, the compiler can presumably tell that these parameters do not get changed and so
make the same optimisations it would have done if they had been declared const.
The use of const to protect code from accidental side-effects when calling functions is
much more important in large projects with multiple programmers contributing to the code.
I.7 Max/Min
These simple functions are widely used, particularly in numerical code. Curiously, in early
versions of Cþþ, they were not available as intrinsic functions. Linux programmers using
gcc relied on the built-in macro definitions:
I.7 Max/Min 447
Either this
#define max(a, b) ( a > b) ? a : b
#define min(a, b) ( a < b) ? a : b
Or this:
The macros were not automatically supplied on Windows platforms using Visual Studio,
leading to portability issues. The upshot was that macros like this proliferated across many
software packages. There are several problems with these macros; firstly they involve
branching and thus might not be the most efficient way of implementing these functions
on some architectures. This applies to NVIDIA GPUs which can perform these operations in
a single instruction. The second problem is that macros are too powerful; they are expanded
by the preprocessor and thus if defined, they prevent any superior intrinsic versions of these
functions being used. Thirdly they do not always work if embedded in complex expressions
or if expressions are used for arguments, the second form in the box with extra brackets is an
attempt to solve this, but even this does not always work.
Modern Cþþ provides the max and min functions as part of the standard library so we
use std::min and std::max in our host code to be sure of getting the best versions. For
device code, CUDA supplies generic max and min functions that work for all standard
arithmetic types using the appropriate built-in functions. If you are worried they might be
overwritten by lurking macros you can use fmaxf and fminf for a pair of floats, fmax and
fmin for a pair of doubles and umax and umin for unsigned ints.
Note our cx.h include file undefines the symbols max and min to help protect your code
from these macros but depending on what other include files you use and in what order they
are included, these macros might get redefined.
More generally macros are deprecated in modern Cþþ; they can usually be replaced by
some combination of the Cþþ using statement for types, constexpr for values and
lambda functions for expressions.
Index
448
Index 449
OpenCV pragma 5
GpuMat 207 printf, debugging with 347
imread 160 pseudorandom number generator 178
imshow 161 PTX code 279, 389
imwrite 161 PTX documentation 406
Mat object 160 PyCUDA 437
waitkey 161 Python 435
OpenGL 337 Python Toolkit 437
OpenMP 1, 399
ompsum 3 Quadro 14
Visual Studio C++ 6 quasi-random number generator 179
OptiX 408
ordered subsets expectation maximisation 243 r_Ptr 60
OSEM radiation transport calculations 283
full iteration 268 RAII 439
method 268 rand 178
ordered subsets EM 243 random number generator 178
subset definitions 268 random numbers true 179
overflow error 403 rank of thread 28
rank of warp 28
P2P See peer to peer ray 275
p2ptest kernel 299 ray equation 275
parallel primitive 6 ray tracing 275–276
parallel programming 9 ray_to_block device function 276
parallel projection 268 ray_to_block2 device function 279
parallel reduction 40–51 ray_to_cyl device function 252
paramset struct 169 ray_to_cyl_doi device function 272, 276
Pascal Architecture 15 raytracing 408
peer to peer 299 sanet.stread after write error 74
PET read-after-write 382
block detectors 274 readspot program 244
cylindrical detector 239 reco program 244
map array 246 recosem program 244
phantom data 244 reduce_classic function 305
photomultiplier 239 reduce_classic_pinned function 307
scanner 239 reduce_coal_any_vl kernel 100
scintillating crystals 239 reduce_half_vl kernel 369
touching cylinder 274 reduce_managed function 310
pi, calculation of 179 reduce_maxdiff kernel 115
piG example 193 reduce_thrust_pinned function 308
piH example reduce_zerocopy function 309
basic version 180 reduce0 kernel 41
cuRand Host API 186 reduce1 kernel 44
faster host RNG 182 reduce2 kernel 46
OMP version 183 reduce3 kernel 48
using cudaMemcpyAsync 188 reduce4 kernel 49
using pinned memory 188 reduce5 kernel 73
pinned memory 214 reduce6 kernel 81
pipeline example 212 reduce7 kernel 83
pixel addressing 142 reduce7_vl kernel 86
pixel aliasing 147 reduce7_vl_coal kernel function 98
pixel interpolation 142 reduce8 kernel 85
pointer aliasing 57 reduceT kernel 367
Poisson’s equation 107 region of interest 249
pol2cart program 244 register file See GPU register file
poluse program 244 resident warps 37
positron annihilation 239 restrict keyword 56
power consumption 375 RGB 14, 126
Index 453
sanet.st