A.2.1 Loop Unrolling.
Loop unrolling is an optimization technique in which the body of a loop is explicitly repeated multiple times [
270]. Unrolling of a loop can be done manually, by explicitly duplicating the body of the loop, or automatically, either using macros, C++ templates [
7], or compiler directives [
294]. Figure
20 shows an example of manual loop unrolling for a simple snippet of code.
Loop unrolling can have various effects that benefit performance. One clear benefit is a reduction in the overhead of loop-related instructions, such as branches and address calculations [
242,
270]. Another benefit is the increase in opportunity for ILP [
242,
270,
282,
345], due to offering more independent instructions to hide latencies. Finally, a common effect of loop unrolling is that it enables further compiler optimizations such as replacing array indices with compile-time constants to enable the array elements to be stored in registers [
67,
216,
320], removal of branches which depend on the loop variable [
112,
318,
439], and vectorization of instructions [
54,
67,
309].
Overall, loop unrolling has been found to be beneficial in many different scenarios. For example, Ryoo et al. [
318] optimize tiled matrix multiplication for the 8800GTX and find that best performance can be achieved for 16
\(\times\) 16 tiles and completely unrolling the innermost loop. This gave a performance increase up to 25% but the choice of the right tile size had much more effect. Perhaps even more interesting is that loop unrolling had no effect on a tile size of 8
\(\times\) 8, most likely caused by register usage that limits the number of threads. Van Werkhoven et al. [
390] show how to loop unrolling can increase the performance of convolution operations on GPUs. Qiu et al. [
308] unroll the innermost loop of a hash function since it removes a large switch statement. Fortin et al. [
112] accelerate Lefèvre’s algorithm and unroll a loop by a factor of two since it removes a conditional that only holds every other iteration. Reddy et al. [
311] completely unroll a reduction tree since it reduces synchronization between threads. Lin et al. [
216] unroll a loop which causes a small array to be no longer dynamically indexed allowing the compiler to store the elements in registers, resulting in a 71% improvement in performance.
However, although unrolling loops is often beneficial, unrolling by a factor that is too large might hurt performance. One reason is that loop unrolling increases the code size and performance can deteriorate when the loop size exceeds the instruction cache size [
270]. Another reason is that unrolling loops can increase pressure on register usage which in turn reduces occupancy [
105,
169,
270,
319], although some authors claim to see a decrease in register usage after unrolling [
320]. The factor by which to unroll a loop is thus an important consideration and various researchers have focused on finding the ideal factor [
270,
287,
309], often concluding that it differs per kernel and device [
287].
A.2.2 Reduce Branch-divergence.
Branch divergence, sometimes also called path divergence, is a problem for SIMD or warp-based architectures such as GPUs where a warp executes instructions in lock-step. On a branch that depends on the thread index, there is a chance that different threads in a warp have to take different paths on a branch. On a GPU this results in all threads taking both branches where the threads that should not take the branch do not write the result. This hurts performance because both branches are executed serially. If the instructions within the branch are simple enough, the branches can be replaced with predicated instructions in which only the threads with a true predicate write back the result of the instructions [
52,
112].
Removing branches. A first technique is to remove branches altogether, since branching contributes to overhead [
23,
97]. In some cases, maximum and minimum functions can remove branches [
59,
231] or other arithmetic operations [
177,
358,
379].
Figure
21 shows an example from Wang et al. [
379]. The original code with branches is at the top and at the bottom, the branches are replaced by arithmetic operations. A shaded box is an instruction for an inactive thread within the warp (a thread for which the condition is false) meaning that the instruction is executed but the result is not written back.
To give an indication of the potential and limitations of branch-divergence reduction, Wang et al. [
379] observe a speedup of 3.45 on an AMD GPU from 2008 using the Brook+ programming language on a branch-intensive kernel and no speedup for this optimization on other kernels. Tran et al. [
358] report speedups around 10% on a Kepler GPU (2012) with a Lattice Boltzmann solver where branches are avoided on the boundaries of the 3D stencil computation.
Chakroun et al. [
68] replace branches with sine, cosine, and exponentiation functions (using fast math functions, see Section
A.3.2), something they call branch refactoring. Vespa et al. take this technique to an extreme in what they call algorithm flattening where each branch is expressed as a computation that is reduced and optimized further [
369].
Jiang et al. perform “instruction stream normalization” [
172] that eliminates branches as well. They store conditions in a 4-bit wide feature vector that indexes in a lookup table in constant memory that provides the thread with the right value. Daga et al. [
80] remove branches with kernel-fission (Section
A.2.4), splitting kernels based on conditionals that can be computed before kernel launch on the CPU. They state that this optimization affected AMD GPUs more in their case since they had less branching units per processing core than NVIDIA GPUs.
Reducing branches. If branches cannot be removed, the next best strategy is to reduce branching. Cecilia et al. [
64] apply Ant Colony Optimization to the Traveling Salesman Problem and notify other threads in the block via shared memory whether a city has been visited. They later improve on this by using warp instructions such as shuffle [
65]. Several articles use loop unrolling to reduce control flow in a kernel [
112,
308,
311,
439]. In loops, it is also possible to adopt a different technique. Han and Abdelrahman introduce iteration-delaying [
140] which delays iterations of a loop with a branch inside such that an iteration only takes the path that a majority of the threads will take. In a later iteration, the other path is taken when it has the majority of the threads. They use warp-voting functions to determine the majority of threads. The authors note that it may be problematic that the minority threads for a certain path may be stalled for a long time. Novak solves this by allowing only stalled threads to take part in the vote, trading off some efficiency for the progress of threads [
286]. Zhang et al. use a somewhat similar method that checks the condition at the end of the loop, allowing threads without work to skip an iteration [
431]. Another loop-based method is introduced by Han et al. that merges nested loops to allow threads in a warp to start the next iteration while other threads execute code from the inner loop [
141].
Trading off branching. Another strategy to reduce branching is to tradeoff divergence against another metric. Okada et al. reduce branching in gene sequence alignment at the cost of more redundant work because threads in a warp take a similar path [
296]. In their case, more redundant work is less costly than branching. Sartori and Kumar introduce branch-herding [
324]. They apply this technique on kernels that can tolerate errors and they use warp voting instructions to find the majority of the threads taking a certain path. They then force all threads in the warp to take this path. Users can control the error rate by using performance counters allowing to set a threshold of forced branches. They apply similar techniques for memory coalescing.
Hong et al. propose a new programming method for dynamic applications such as graph processing that discriminates between SIMD execution and traditional, sequential SISD execution [
154]. For the latter mode, each warp executes the same instruction on the same data, which reduces the parallelism within a warp but eliminates branch divergence. By decomposing warps into smaller virtual warps, they can tradeoff branch divergence with serial execution within a warp.
Reducing the effect of branches. In contrast to reduce branching, it is also possible to reduce the effect of branching. Han and Abdelrahman introduce branch-distribution [
140] that aims at reducing the code in branches as much as possible, removing code that does not necessarily need to be inside branches, often at the cost of introducing new registers to store temporary results [
112,
140,
178]. If the instructions in the branching code are simple enough, the compiler may be able to predicate the instructions, masking threads that should not write a result. Lalami and El-Baz use a similar technique by only employing registers in the branches [
198].
Thread/data remapping. The above techniques apply to branches directly but it is also possible to affect branching indirectly. We will discuss the parallelization scheme and data layout in the following paragraphs focusing on more static applications first.
In stencil computations, often data is loaded into shared memory including a border. Instead of using branches to load the border data, some assign threads to the border as well that in a later stage do not participate in the computation [
206,
251,
262,
402]. Itu et al. apply a different strategy and do not load the left and right border into shared memory but instead load these values directly from global memory [
168]. Yet another approach for stencils is to assign all the divergent load operations across all the warps to a single warp to speed up the others [
316].
Phuong et al. study several optimizations of reductions on GPUs and the first optimization to apply is avoiding branch divergence [
305]. They achieve this with a strided access and evaluate this on GPUs with different generations. They notice that the newer GTX 970 benefits more from this optimization than older ones such as the GTX 560-Ti.
There are more articles that change the parallelization scheme for reducing branch divergence. Chang et al. present a dynamic tiling solution for a high-performance tridiagonal solver [
69]. Honda et al. employ warp-synchronous programming to perform high-precision integer multiplication (1024 bits). They assign the multiplications to threads such that branch divergence is mostly avoided [
150]. Yamashita et al. propose new thread assignments to the data of optimal polygon triangulation [
405]. With these new thread assignments, they eliminate branch divergence. Hu et al. reduce branch divergence in a graph application by processing an edge per warp instead of an edge per thread [
160].
Zhang et al. investigate how to reduce branch divergence in highly dynamic, irregular applications where static analysis does not apply [
426]. They propose the thread-data-remapping technique to reduce branch divergence with two mechanisms: transform the data layout and reference redirection. The former technique creates a copy of the data such that it minimizes thread divergence and the latter technique creates an extra indirection to a memory access such that the accesses minimize thread divergence. Since these techniques result in overhead, especially the data layout transformation, they apply pipelining to overlap kernel execution with changing the layout that happens on the CPU. In addition, they propose an algorithm to minimize the data layout transformation by replacing it with reference redirection. In later work [
427] the authors extend their work to the G-Streamline framework that also focuses on irregular memory accesses and introduces an additional technique
job swapping that essentially results in reordering the thread indices. Lin et al. present a source-to-source compiler solution that applies thread-data-remapping on the GPU instead of on the CPU [
219]. Later work also supports nested branches [
218].
An often used term in combination with warp divergence is thread block compaction introduced by Fung and Aamodt [
114], a micro-architectural improvement. This term or the more general term stream compaction is used in combination with warp-divergence but is not to be confused with stream compaction as a parallel algorithm to filter items, also a target of optimizations on GPUs [
42,
327]. In the context of warp-divergence, it means that threads for a specific path are filtered to execute together.
Besides the work from Zhang et al., the work by Khorasani et al. can also be regarded to belong to this class of solutions. They propose Collaborative-Context-Collection that targets loops with branches [
183]. Threads in a warp use shared memory to store a context that describes which branch would be taken. If enough contexts for a specific branch are recorded, the threads in a warp choose a context to execute, ensuring that there is no branch divergence.
Gmys et al. use several techniques to reduce the amount of branching code [
121]. They apply stream compaction by first computing which similar work should be assigned to each thread, then launch a reduced number of threads to perform the operations. In addition, for code in which they cannot prevent a high degree of branching, they launch kernels with only one thread per warp, similar to the virtual warps that Hong et al. [
154] introduced.
Related to compaction is sorting the data such that a similar workload is assigned to the same threads in a warp or block. Burtscher et al. present a Barnes-Hut n-body application. They sort the bodies to group spatially close bodies together. This grouping reduces branch divergence because force computation is initially the same for close-by bodies [
59]. Nasre et al. apply a similar technique to sort triangles in Delaunay Mesh Refinement [
276]. Others use similar techniques [
68,
175,
301]. Kovac et al. predicts the number of iterations that a thread has to make and sorts the tasks based on this prediction [
189]. Arumugam use a heuristic based on the previous iteration of their algorithm to map threads to similar tasks [
33]. Djenouri et al. also sort the input based on size and distribute this equally over the thread blocks [
90]. However since this distribution may still result in thread divergence, they apply statistical techniques to minimize this occurrence.
Change the data layout. In a sense, sorting the data is changing the data layout but it is tied to parallelism in a high degree. Another solution that changes the data layout more purely is padding [
64,
396,
435]. Sitardi et al. search strings in parallel for a certain first match [
332]. They first group the strings by length and then split the strings in multiple phases to give each thread in a warp an equal-sized string.
For sparse data structures such as sparse matrix-vector multiplication or graph algorithms there are many techniques used for branch divergence as well. Several researchers introduce sparse formats that reduce branch divergence [
109,
242,
342,
352,
366,
386]. Some formats are specifically designed with the goal to reduce branch divergence: Ashari et al. introduce the Blocked-Row Column format [
34], Khorasani et al. propose a graph representation format called G-shards [
186] and other graph formats are designed with reducing branch divergence in mind as well [
109,
253,
266]. Sha et al. decode the adjacency list of a graph in two steps reducing branch divergence [
328].
Algorithmic changes. Finally, there are also techniques that are highly tied to the algorithm they execute: Zhang et al. improve with their framework RegTT [
428] on AutoRopes [
122], frameworks that perform multiple queries in parallel while traversing trees. RegTT improves on AutoRopes by employing breadth-first-search until a certain depth before a depth-first search that allows it to reorder the queries for common paths. This ensures that threads in a warp have a high chance of reduced branch divergence.
A.2.3 Sparse Matrix Format.
GPUs are commonly used for sparse linear algebra operations, and most notably SpMV. The memory format used to represent a sparse matrix has an important influence on performance and an abundance of different formats have been proposed. This section discusses some of the most prominent contributions in sparse formats designed for GPUs. We refer to Bell and Garland [
46] for an overview of the essentials and to Muhammed et al. [
266] for a recent review. Figure
22 shows three common formats from Bell et al. [
46]. The ELL format has padding which is marked with
*.
It is difficult to give an indication of performance benefits because sparse matrix-vector multiplication is a highly irregular application where performance relies heavily on the input. Muhammed et al. [
266] give a good overview with mean and aggregate throughput over various inputs. Concretely, for their SURAA sparse matrix format, they report an average speedup over a range of inputs between 1.1 and 40 depending on the competing format on a Kepler GPU (2012). For specific inputs, they report a speedup of between 1.94 and 562 but also slowdown compared to a competing format.
The
ELL format [
46] (also called
ELLPACK) is commonly used for matrices where the
number of non-zeros (
nnz) is roughly consistent across all rows which is common for matrices obtained from structured grids or semi-structured meshes. In this format, a sparse matrix is condensed into a dense matrix by removing the zeros, shifting the non-zero entries to the left, and padding with zeros until reaching exactly
K entries in each row. This results in a format which is highly regular and suitable for GPUs, at the cost of adding padding. Several improvements have been proposed to overcome this padding overhead. For instance,
ELL-R [
365,
366] adds an additional array storing the actual length of each row without padding. This allows threads to skip the padding zeros and avoids performing useless work, albeit at the cost of storing an additional array.
Sliced ELL [
264] divides the matrix into strips of
C adjacent rows and stores each strip in ELL format. The value of
K differs in each strip, thus reducing the overhead of padding for the overall matrix.
ELLR-T [
100] combines these two ideas: it stores each strip in ELL format, like sliced-ELL, and also adds an additional array storing the row lengths, like ELL-R. SELL-C-
\(\sigma\) [
192] improves upon sliced-ELL in two ways: it ensures that C is a multiple of the SIMD width, to allow vectorization, and it sorts each consecutive
\(\sigma\) rows by row lengths, putting rows of similar lengths into the same slice. Kreutzer et al. [
192] perform an extensive analysis of the parameters
C and
\(\sigma\) to tune the best performance for different architectures.
Instead of modifying an existing format, another direction of research is into combining different sparse formats. For example, Bell and Garland [
46] propose a hybrid format called
HYB which stores the first
K non-zeros of each row in ELL format and any excess non-zeros in COO format. This benefits from the regularity of the ELL format without suffering from excessive padding. Yang et al. [
413] combine ELL and CSR: rows of the matrix are grouped and each group is stored as either ELL or CSR format based on a simple heuristic. Su et al. [
339] take a more rigorous approach by presenting the
cocktail format. In this format, the matrix is partitioned and each submatrix is stored in one of 9 different existing formats. They present a framework which is able to recommend the best partitioning of any given sparse matrix for certain architectures based on offline training data.
Sparse matrix operations usually have a low arithmetic intensity and are memory-bound [
403]. Compression of the format can help to increase throughput at the cost of additional computations. For instance, Xu et al. [
403] use index compression to store the column indices using 16 bits instead of 32 bits, but it works only for certain matrices. Tang et al. [
352] propose a
bit-representation optimized (
BRO) compression scheme which consists of delta encoding followed by bit packing. They apply their scheme to three existing formats and see an average speedup of
\({1.5}{\times }\) . Yan et al. [
409] propose a format called
blocked compressed common coordinate (
BCCOO) which relies on bit flags to compress the row index array. This essentially achieves a compression ratio of 32 (from 32-bit integers to 1-bit per index), but at the cost of decoding the indices at runtime.
The above formats assume matrices are static, but—often in the context of graph processing where sparse matrices represent graphs—there are also dynamic formats that allow addition and removal of edges at runtime. Méndez-Lojo [
253] present a dynamic format based on wide sparse bit vectors which they use to concurrently apply graph rewrite rules. Busato et al. [
61] present
Hornet which represents the matrix through a hierarchical data structure that relies on vectorized bit trees.
A.2.4 Kernel Fission.
Kernel fission is either the process of dividing a single kernel into multiple ones, or of splitting a single kernel iteration into multiple iterations. It can be seen as the opposite of kernel fusion (Section
A.1.8). The way this optimization improves performance is by better resource utilization, achieved through simpler and more regular kernels.
A classic example of the use of kernel fission can be found in the work by Reddy et al. [
311], where a parallel reduction is split in two consecutive kernel executions where the first one computes multiple partial reductions in parallel, and the second one combines the partial results to compute the final value. In a similar way, this optimization is used in the work by Satish et al. [
325] to overcome the lack of inter-block synchronization (Section
A.3.10) on the GPU.
In the context of cryptography, kernel fission is often used to simplify the structure of complex kernels [
23,
52]. Particularly interesting is the fact that An et al. [
23] decided to execute one of those kernels on the CPU, therefore combining this optimization with the CPU/GPU computation technique (Section
A.4.2). Kernel fission has also been used as a means to
reduce branch divergence [
63,
80] (see Section
A.2.2), by having different kernels for different branches, and then letting the host execute the correct kernel depending on run-time conditions. However, this solution can be complex in practice, and in the use-case provided by Daga et al. [
80] a single kernel has been replaced by 16 different ones, one for each possible execution path. However, with this technique they obtain a speedup of 70% on an AMD Cypress GPU (2009), an architecture that is low on branching units using a molecular modeling application.
Improving regularity is something for which kernel fission has been extensively used. Multiple examples can be provided for sparse matrix-vector multiplication [
35,
81,
89], where splitting a single kernel into multiple ones can help in exploiting local regularities. Even in the case of
dense matrix-vector multiplication, this optimization can be used to enable processing of matrices of arbitrary dimensions [
8]; in this article, Abdelfattah et al. use a different kernel to process the bottom rows of the matrix, instead of implementing corner cases in the main kernel.
Also worth mentioning is a connection between kernel fission and
auto-tuning (see Section
A.3.6). In fact, among the reasons Alimirzazadeh et al. [
19,
88,
112] have to split their original kernels into smaller and simpler ones is that these smaller kernels can be more easily optimized, and their run-time configurations better tuned.