Keywords

figure a

1 Introduction

The collection-oriented approach to data-parallel programming, where computations are expressed in terms of higher-order functions—such as maps, folds, and scans—over (multi-dimensional) arrays provides a powerful and convenient programming model. By allowing programmers to identify the parallelism of an algorithm explicitly, yet in an abstract, architecture-independent way, languages such as Futhark  [7, 12,13,14], Manticore  [9], Lift  [23], and Accelerate  [3, 6, 17, 18] have demonstrated that it is possible to achieve performance comparable to hand-optimised, low-level code on a range of concrete hardware architectures, such as GPUs and multi-core CPUs.

These languages are typically restricted to regular data-parallelism: they support executing nested loops in parallel only if the inner loop bounds are independent of the indices of the outer loops, thereby limiting the kinds of parallel algorithms which can be conveniently expressed. While it is possible to transform any irregular data-parallel computation into one containing only regular data-parallelism  [2], doing so efficiently in practice has still, in general, not been achieved. Instead, techniques to add limited support for irregular computations to regular data-parallel languages—without compromising performance—are used  [6, 7]. This work is a further step in this direction. We discuss our approach in the context of the language Accelerate  [3], but it applies to any similarly structured language. The main contributions of the paper are:

  1. 1.

    An extension of the regular data-parallel programming model to allow for the parallel execution of array-level conditionals and iterations over irregular nested structures, enabling a larger class of problems using irregular nested parallelism to be executed efficiently (Sect. 3)

  2. 2.

    A shape analysis to detect at compile time when shapes of nested arrays are equal (Sect. 4)

  3. 3.

    A program analysis to identify the regular subcomputations of an irregular program (Sect. 4)

  4. 4.

    Benchmarks demonstrating the effect of the optimisations enabled by the shape equality and regularity detection analyses (Sect. 5)

We defer the discussion of related work to Sect. 6.

2 Background

In this section, we give an overview of Accelerate as a representative of the collection-oriented programming model. We discuss the difference between regular and irregular data parallelism, and why the expressiveness of the latter significantly complicates the mapping of the high-level operations to efficient code.

2.1 Accelerate’s Programming Model

The collective operations with which we express parallel computations in Accelerate are based on the scan-vector model  [4, 22], and consist of multi-dimensional array variants of familiar Haskell list operations such as and , as well as array-specific operations such as index permutations. In this paper we use a slightly simplified syntax for Accelerate programs for the sake of readability: Accelerate is deeply embedded in Haskell, so the types of expressions are wrapped in a type constructor, which we omit here, as we do with class constraints. For example, to compute the dot product of two arrays, we write:

figure d

Accelerate is rank-polymorphic, meaning that operations work on arrays of arbitrary rank, or dimensionality. The rank of an array is encoded in the type, denoted by a subscript in our example code. The function consumes two multidimensional arrays of rank \(n+1\) and produces an array of rank n as output, by folding along the innermost dimension. The type of is:

figure g

There are two sources of parallelism here: the actual reduction can be done in parallel in \(\log n\) steps using a tree fold, and for arrays of rank two and higher, we can do all the tree folds in parallel. In Accelerate, both sources of parallelism are exploited. This is a limited form of nested parallelism—regular nested parallelism—where the size of the inner parallel loop is the same for all iterations.

The operation is a parallel loop construct which takes as input a shape descriptor of type specifying the extent of the resulting array, and a function that will be applied to each index of that shape to compute the value at that index. In the regular data-parallel model, the operation passed to is restricted to a sequential function returning a single scalar value:

figure k

All parallel operations so far have the property that the extent of the output array is independent of the values of the array elements. Unfortunately, there are useful operations for which this is not the case. For example, consider the function , which removes elements of an array which do not satisfy a given predicate. A rank-polymorphic , where the output array has the same nesting depth as the input array, requires that the shape of the innermost nesting level is ragged. We use the type for an array of nesting depth , where the outermost dimensions are guaranteed to be regular, and the inner are potentially irregular. The type of the operation becomes:

figure s

We also have segmented versions of parallel operations, which take irregular arrays as input. For example, the segmented fold calculates the sum of each of the innermost segments of an irregular array in parallel, and has the type:

figure t

Apart from collection-oriented operations, we have an array-level conditional operator , and an iteration construct , which iteratively applies the function to initial array as long applied to the current iteration value is . For example, assume is a (potentially) parallel implementation of the inner loop of Bubblesort, and a function which check whether an array is sorted, then we can write:

figure ac

In this example, the size of the resulting array is the same as the size of the input array, independent of the number of iterations, as should not change the size of its input. In general, however, this is not the case. Similarly, the two branches of a conditional do not need to evaluate to arrays of the same shape.

We extend the regular data-parallel programming model by allowing both and to occur inside of regular nested parallel loops. This generalisation introduces the possibility for this previously regular operation to introduce irregular nested parallelism, but affords the programmer more flexibility to express a larger range of applications. The techniques we present in this paper are aimed at minimising the costs of irregular parallelism arising from these constructs.

Continuing our previous example, if we want to apply the parallel bubble sort program in parallel to a collection of arrays, we can now write:

figure ag

where returns the outer regular shape of an array, which in this example is the number of inner arrays n. Since we know that leaves the size of its input unchanged, we also know that will return an array with the same shape as its input. In particular, if the input array happens to be regular, then the output array will also be regular. The aim of our shape analysis (Sect. 4.2) is exactly to check whether shapes stay the same. Our regularity detection (Sect. 4.3) can then use this information to find regular subcomputations.

3 Preserving Regularity

Flat arrays of primitive type are, for the majority of parallel architectures, the most efficient representations, and in case of GPUs, actually the only structure which is supported. Therefore, we need to represent the nested arrays of our source language by flat data arrays with the shape information stored separately.

For regular nested arrays of rank , that is not a problem as they can be represented efficiently by storing the elements in a flat data vector in row-major order, and we can store the size of each dimension as an array of integer values of length n. For example, consider the two arrays and in Fig. 1. The former is regular, and the shape can be represented compactly, whereas we have to store the size of each segment for the latter, which can incur a significant memory overhead, especially if there are many small or even empty segments. Operations like indexing into the array are also more expensive for irregular representations. To index into the third subarray in Fig. 1, we have to calculate the sum of the sizes of all the preceding subarrays in the irregular case, whereas for the regular, calculating the offset is just a simple multiplication.

Fig. 1.
figure 1

Representation and indexing for regular and irregular arrays

Programs manipulating irregular arrays are therefore more expensive, both in terms of the additional memory required to store the size of each segment on every level, as well as the extra processing required to maintain and manipulate the segment descriptor. This is exactly why we want to use the regular array representation whenever possible.

3.1 Statically Determining Regularity

Clifton-Everest  [6] added irregular arrays to Accelerate with support for a single level of nested parallelism. In that formulation, the regularity of an expression is evident from the type of the operators used; for example, the operation preserves the regularity of its input and removes the innermost dimension while preserving the regularity of the outer \(n-1\) dimensions.

We extend that work by adding support for regularity preservation in the presence of nested control-flow operators, and .

Let us go back to our example (see Fig. 2), and replace the function with the function , which returns an array twice the size of the input array. If we don’t know how many iterations the -loop performs on each subarray, we can’t ensure that the resulting nested array is regular. However, if we know that the number of iterations depends only on the size of the input array, then we again know that regularity is preserved as the termination condition function returns the same value for each row of a regular array, even if we do not statically know its exact shape.

Fig. 2.
figure 2

The parallel function preserves regularity if (1) the iteration is applied the same number of times to all subarrays or (2) the iteration function does not alter the shape of its input.

The parallel application of the operation preserves regularity, for example when either all conditionals take the same branch, or the subarrays of the true and false branch have the same shape. In the first case we will end up with one of the two regular branches, thus will stay regular. In the second case, subcomputations may take different branches, but the output shape has the same shape as the two branches and therefore remains regular.

Both parallel -loops and conditionals occur in many applications, so it is worthwhile to detect the cases where their use preserves the regularity of their inputs, and use regular code and array representations in these cases. The next section formalises the analyses we use to detect patterns such as those mentioned above, where regularity is preserved.

4 Program Analyses

The goal of our two program analyses is to identify, at compile time, the regular (sub-)computations of the program, so that the more efficient regular data-parallel operations and array representation can be used for those computations. Our analysis consists of two parts: regularity detection generalises vectorisation avoidance  [16] and identifies sub-expressions that are independent of their surrounding parallel context, using information from our shape analysis that determines equivalence of array shapes. We discuss these analyses in the following section in the context of a small nested data-parallel core language.

4.1 Core Language

Listing 1 gives the grammar of the core language which we use to describe our analyses. This language is a generalisation of the core language of Clifton-Everest  [6], allowing for arbitrary nesting depth and with the addition of control flow constructs and .

figure ba

Expressions e consist of variables v; scalar and array constants c (subscript \(_{(l, \dots , l)}\) indicates the exact dimensions); array indexing !; application of primitive operators p; let-expressions; as well as tuples and projections from tuples \(\pi _l\), where l is the index of an element in the tuple. The operator returns the outer regular shape of an array; constructs an array of the given regular shape by applying the function to every index of that shape in data-parallel; performs a parallel tree-reduction over the inner-most dimension of an array using the supplied binary function and initial element; and are conditional and iteration constructs as described in Sect. 2.1. Irregular nested arrays are introduced as array constants, or constructed via , which—in contrast to Accelerate—does not limit the result type of the generator function to scalar values.

4.2 Shape Analysis

Before we formalise our shape analysis, first consider the following example, which illustrates a common pattern we wish to detect:

figure bh

This term uses nested applications of to add one to every element of the array . The goal of the analysis is to determine that the shape of and are, in fact, identical.

We use the shape analysis in Sect. 4.3 to identify regular subcomputations, but it can be used to enable other optimisations, such as preventing redundant recomputation of segment descriptors, array recycling, and identifying opportunities to use destructive updates.

Formalisation. Our shape analysis proceeds by first building an abstract shape descriptor for every array in the program, and then simplifying these descriptors so that they can be compared for equivalence. We write \(ns_1 \eqcirc ns_2\) to denote the shape equivalence. Note that this comparison is not exact; since we do not have all information available to us at compile time, the equivalence test is necessarily conservative: if two shape descriptors are found to be equal, their associated arrays will definitely have the same shape at runtime, but the reverse is not necessarily true.

Our shape descriptors are constructed using the following grammar:

figure bm

A shape s is either an expression e of type , which may contain free variables bound in environment \(\varSigma _j\); \( Folded \ s\), which drops the innermost dimension of s; or \( Outer \ ns\), the outermost shape of the nested shape ns. Nested shapes ns are \(\mathbb {S}\)-terminated lists of s; tuples of nested shapes; the result of projections; or the result of indexing into a shape list with an expression e of type , thus taking the Inner shape. Complex (irregular) shapes or shapes for which we don’t have any static information are represented by a unique label \(u_{l}\).

figure bp

The judgement \({\varSigma _j} \vdash {e} :_S {ns}\) denotes the derivation of shape descriptor ns for the expression e under environment \(\varSigma _j\) according to the rules in Listing 2. The environment maps every variable v to its shape descriptor (ns) and nesting level (i) of the whose function bound v. Variables not introduced via a function—that is, are not a potential source of nested parallelism—have nesting level \(\emptyset \). The environment is annotated with an index j, denoting the number of generate calls we entered. This index is used as the nesting level of the variable introduced by the next generate combinator and we thus increment the index when entering its body.

Returning to our initial example, assuming environment \(\varSigma = [\mathtt {xss} : (u_0,\emptyset )]\) containing only the array , about which we have no static information, we can then deduce:

$$\begin{aligned}&{\varSigma _0} \vdash {\mathtt {xss}} :_S {u_0} \\&{\varSigma _0} \vdash {\mathtt {yss}} :_S { ({ \langle \varSigma _0; \texttt {extent~xss} \rangle } \vartriangleright { { \langle \varSigma _1, sh : (\mathbb {S},0); \texttt {extent~(xss!sh)} \rangle } \vartriangleright {\mathbb {S}} } } \end{aligned}$$

where we apply the rule twice and subsequently the rule.

Although the two shape descriptors for and are equivalent, they are not syntactically equal. We thus introduce shape equivalence, denoted by \(ns_1 \eqcirc ns_2\), which compares shape descriptors after partially evaluating the shape descriptors, for example by applying projections to tuples. Furthermore we simplify certain patterns which we found to occur frequently. Note that other domain-specific simplification rules may also be possible. The following steps can always be applied to a single shape descriptor:

  1. S1.

    \(\langle \varSigma _j; \texttt {extent~e} \rangle \): we apply shape analysis on e, \({\varSigma _j} \vdash {e} :_S {ns}\), and take the outer shape as a result, \( Outer \ ns\). This is exactly the semantics of .

  2. S2.

    \({ Outer \ ns} \vartriangleright {\mathbb {S}}\): if ns is not nested, which can be determined from type information, it simplifies to ns.

  3. S3.

    \( Outer \ (s \vartriangleright ns)\) simplifies to s.

  4. S4.

    \( Outer \ ns_1 \vartriangleright Inner \ ns_2 \ \langle \varSigma _j; v\rangle \): this pattern arises from nested s, e.g. . If \(ns_1 \eqcirc ns_2\) and v’s nesting level matches the nesting depth of the shape, the shape descriptor simplifies to \(ns_1\). The nesting depth of the shape denotes how many \(\vartriangleright \) are in front of the shape in the \(\vartriangleright \)-separated list. For instance, in \(s' \vartriangleright s \vartriangleright ns\), the whole shape has a nesting depth of 0, \(s \vartriangleright ns\) has depth 1, and ns has 2.

After the simplification, when two shapes are compared, we check on syntactical equivalence. However, when comparing shape expressions we have a few more equivalence rules:

  1. E1.

    \(\langle \varSigma ; e \rangle \): any variables inside that were introduced by a only have to match by their nesting level. This is stored in the shape environment \(\varSigma _j\).

  2. E2.

    \(\langle \varSigma _j; e \rangle \): when we encounter (as a subexpression) in , we apply shape analysis on .

Using these simplification rules, the shape descriptor can be rewritten to be equal to the shape descriptor of in the following steps:

$$\begin{aligned}&= { Outer \ u_0 } \vartriangleright { { Outer \ ( Inner \ u_0 \ \langle \varSigma , sh : (\mathbb {S},0); sh\rangle ) } \vartriangleright {\mathbb {S}} }&\texttt {(S1,~S1)} \\&= { Outer \ u_0 } \vartriangleright { Inner \ u_0 \ \langle \varSigma , sh : (\mathbb {S},0); sh\rangle }&\texttt {(S2)} \\&= u_0&\texttt {(S4)} \end{aligned}$$

The shape analysis can be parameterised by which simplification rules to apply; it depends on the application context of the shape analysis which rules are worthwhile. One additional rule which we do use is inlining of all let-bound variables in expressions, for which we have another environment containing the definitions for all variables which are in scope.

4.3 Regularity Detection

Regularity detection identifies (sub-)expressions which are either constant or produce regular parallelism with respect to the surrounding parallel context. For example, if we the function over an array in parallel, then the value of depends on the parallel context, but the expression is constant with respect to that context.

Keller  [16] provide an algorithm to identify these subexpressions in the presence of arbitrarily nested contexts, however that work does not take regularity information into account, and therefore misses important optimisation opportunities. Take for example the term ; if is a regular nested structure this function returns the same result for all values , so the term as a whole is constant even though it depends on the parallel context. In the remainder of the section we formalise our generalisation of the vectorisation avoidance  [16] algorithm to take this information into account.

Formalisation. Listing 3a presents the grammar for the analysis, where regularity information is stored as a triple d, with i denoting whether the full term is totally independent (\(\top \)) or not (\(\bot \)); \(n\) records for each nesting level whether it is regular (R) or irregular (Ir); and \(k\) tracks the nesting level of the variables introduced by (only can introduce nested computations which are dependent on the parallel context). Merging of regularity information is done via the operator \(\wedge \) given in Listing 3c. The analysis uses the results of shape analysis and thus passes around a shape environment \(\varSigma _j\) besides the regularity environment \(\varGamma \), mapping variables to their regularity. The judgement \({\varGamma ; \varSigma _j} \vdash {e} :_R {d}\) denotes that expression e has regularity d under environments \(\varSigma _j\) and \(\varGamma \). We present the rules of regularity detection in Listings 4 and 5, where we use \({\varGamma ; \varSigma _j} \vdash {e} : {(d, ns)}\) to denote the results of both analyses:

(1)

The rules for and must check whether the shapes are respectively fixed (rules ), independent (), or whether we must assume that they may be irregular (). Rule checks whether the argument array is regular, in which case it always returns the same extent and is therefore independent. We have three rules for , rule checks whether the nesting level \(k\) of the function is greater than or equal to the current level; if so the function is independent of any outer operations. Rule checks whether the outer shape of the is independent, meaning the operation as a whole is regular, and is the fallback case.

figure cv
figure cw
figure cx

We want to conclude with a more interesting example, we modified our previous example from Sect. 4.2 to contain a conditional.

figure cy

The shape analysis can detect that has the same shape as , but it will also show that and have the same shape. Suppose we know that is a regular nested array. We show that the regularity detection detects that the sub-computations stay regular. Formally, we now have the environment \(\varGamma ; \varSigma = [\texttt {xss} : \langle {R} \vartriangleright { {R} \vartriangleright {\mathbb {S}}}, \top , \infty \rangle , \texttt {sh} : \langle \mathbb {S}, \bot , 0 \rangle ] ;[ \texttt {xss} : (u_0, \emptyset ), \texttt {sh} : (\mathbb {S}, 0)] \), where we added the variable introduced by the outer generate. Let us inspect the result of , and , which we need for .

$$\begin{aligned} {\varGamma ; \varSigma _1}&\vdash {\texttt {c}}:_R{ \langle \mathbb {S}, \bot , 0 \rangle }&&\\ {\varGamma , c : ...; \varSigma _1, c : ...}&\vdash {\texttt {t}}:_R{ \langle {R} \vartriangleright {\mathbb {S}}, \bot , 0 \rangle }&&\\ {\varGamma , c : ..., t : ...; \varSigma _1, c : ..., t : ...}&\vdash {\texttt {e}}:_R{ \langle {R} \vartriangleright {\mathbb {S}}, \bot , 0 \rangle }&&\end{aligned}$$

The results of and are a simple application of a combination of the , , and rules and you can view them as a scalar and a regular array respectively. Both are also dependent (\(\bot \)) on the outer generate (0). The result of is the same as , but the rules are used.

With the above results and the fact that and have the same shape, we use the rule to get:

$$\begin{aligned} {\varGamma , ...; \varSigma _1, ...}&\vdash {\texttt {cond~c~t~e}}:_R{ \langle {R} \vartriangleright {\mathbb {S}}, \bot , 0 \rangle }&&\end{aligned}$$

Thus the body of the has sub-computations that are dependent, but regular. Finally, using on the outer generate, gets us that is a nested regular array that is independent of any other parallel context.

5 Evaluation

The objective of this work is to extend the data-parallel programming model to efficiently execute array-level conditionals and iterations over irregularly nested structures. In this section we evaluate the effectiveness of our work through a number of benchmarks. Our benchmarks are conducted using a GeForce RTX 2080 Ti (compute capability 7.0, 68 multiprocessors = 4352 cores at 1.65 GHz, 11 GB GDDR6) backed by a 16-core Threadripper 2950X (1.9 GHz, 64 GB RAM, hyperthreading is enabled) running GNU/Linux (Ubuntu 19.10). We used GHC-8.6.3, LLVM-9, and CUDA-10.1.

Our implementation in the deeply embedded language Accelerate means that the analyses presented here, as well as all other compiler stages such as optimisation and code generation, occur during the runtime of the host language program. In order to focus on the effectiveness of the optimisations presented in this paper, which are generally applicable and not related to our specific implementation, we report total kernel execution time on the GPU including memory transfer overhead rather than overall application runtime.

Fig. 3.
figure 3

Weak scaling of benchmark programs. The results of this work are shown in purple (Accelerate, Regular), compared to unoptimised Accelerate programs in green (Accelerate). (Color figure online)

5.1 Quicksort

To evaluate the overhead of irregularity we use a loop-based implementation of the Quicksort  [15] algorithm,Footnote 1 which is representative of irregular divide-and-conquer algorithms that require both the intra- and inter-routine parallelism to achieve optimal parallel work complexity. This benchmark is chosen because it has minimal computation: the runtime of the algorithm is entirely dominated by data movement, so the cost of managing segment descriptors for irregular arrays cannot be hidden.

The benchmark sorts each row of an \(n \times m\) matrix in parallel, so each row of the matrix iterates a different number of times over its subarray. Our analysis detects that the iteration leaves the shapes of the subarrays unchanged so can be optimised as regular nested parallelism. The results are shown in Fig. 3, showing that our optimised version is 6 to 13 times faster than the unoptimised Accelerate program.

Futhark uses a different method to support nested parallelism  [14]. For small arrays the GPU is not fully utilised, but for larger arrays their approach has lower overhead and overtakes our implementation. The incremental flattening approach of Futhark is orthogonal to this work, so it would be possible to utilise both approaches simultaneously.

5.2 Fast Fourier Transform (FFT)

We benchmark three versions of the Split-Radix FFT algorithm in Accelerate: normal, where we do not exploit the nested parallelism; regular with nested parallelism and optimisations switched on; and irregular, a nested parallel implementation without optimisation. The Split-Radix FFT algorithm consists of an outer while loop which operates over successively smaller arrays. Futhark does not support irregular nested parallelism and their algorithm can not detect that the computation is regular, so their compiler is unable to compile this program. We instead benchmark the Stockham algorithm in Futhark, which it is able to compile.

Figure 3 shows the results of execution a number of \(32\times 32\) Fourier transforms. As a baseline we also compare against the highly optimised cuFFT library, which we call via Accelerate’s foreign function interface  [5]. The unoptimised irregular code is more than an order of magnitude slower than our optimised regular implementation. The program normal is unable to execute multiple FFTs in parallel and thus performs poorly at large array sizes—even compared to the unoptimised irregular implementation—as it does not expose enough parallelism to properly utilise the GPU.

6 Related Work

Languages like NESL  [2] and Data Parallel Haskell  [19] support fully irregular nested data parallelism, but they struggle to achieve good performance. Nessie  [21] is ongoing work on a NESL compiler which targets GPUs. Manticore  [9] also supports irregular nested data parallel computations on CPU multicores by flattening the data structures  [1], but not the parallel computations.

To control excessive parallelism due to regular nested parallelism, incremental flattening  [14] executes the inner parallelism sequentially of a nested computation in some circumstances, which allows for better use of shared memory in GPUs. More recently Futhark added more support for a certain kind of irregular nested parallelism  [7], but this has not been integrated into the backend yet  [8]. Futhark also performs shape analysis  [12] to symbolically determine the exact shape of arrays if possible but switches to dynamic handling if necessary. Our analysis aims to compare shapes, not determine them exactly, so it can be done completely statically. We can capture some irregular structures of arrays, whereas Futhark only works with regular structures.

Other data parallel languages, like Halide  [20], Obsidian  [24], Lift  [23] and SaC  [10] all aim at producing high performing code for CPUs and/or GPUs. We believe they could benefit from the work presented here, in the implementation of irregular nested data parallelism or to allow more programs which expose regular subcomputations.

7 Conclusion

We presented two analyses for irregular nested parallel array languages, and demonstrated how this analyses can be used to identify and specialise code for regular sub-computations within nested irregular computations. We extended the Accelerate language with two constructs to enable expressing a limited form of irregular nested parallelism, together with our regularity optimisations, and provide benchmarks demonstrating the effect of these optimisations. Our work is open source and available at https://github.com/sakehl/accelerate/tree/feature/sequences.