research-article

Open access

The Droplet Search Algorithm for Kernel Scheduling

Authors: Michael Canesche, Vanderson Rosário, Edson Borin, Fernando Quintão PereiraAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization, Volume 21, Issue 2

Article No.: 35, Pages 1 - 28

https://doi.org/10.1145/3650109

Published: 21 May 2024 Publication History

PDF eReader

Abstract

Kernel scheduling is the problem of finding the most efficient implementation for a computational kernel. Identifying this implementation involves experimenting with the parameters of compiler optimizations, such as the size of tiling windows and unrolling factors. This article shows that it is possible to organize these parameters as points in a coordinate space. The function that maps these points to the running time of kernels, in general, will not determine a convex surface. However, this article provides empirical evidence that the origin of this surface (an unoptimized kernel) and its global optimum (the fastest kernel) reside on a convex region. We call this hypothesis the “droplet expectation.” Consequently, a search method based on the Coordinate Descent algorithm tends to find the optimal kernel configuration quickly if the hypothesis holds. This approach—called Droplet Search—has been available in Apache TVM since April of 2023. Experimental results with six large deep learning models on various computing devices (ARM, Intel, AMD, and NVIDIA) indicate that Droplet Search is not only as effective as other AutoTVM search techniques but also 2 to 10 times faster. Moreover, models generated by Droplet Search are competitive with those produced by TVM’s AutoScheduler (Ansor), despite the latter using 4 to 5 times more code transformations than AutoTVM.

1 Introduction

In the context of this article, a kernel is a function that reads and write data indexed by a linear combination of natural numbers. Kernels are typically implemented as nests of affine loops. Examples of kernels include matrix multiplication, transposition, and convolutions. Following Jin et al. [2022], we say that a deep learning model is a function implemented as alternating layers of kernels and non-linear functions (sigmoid, rectified linear unit (ReLU), etc.). Examples of deep-learning models include neural networks, such as BERT [Devlin et al. 2019], ResNet-18 [He et al. 2015], VGG-16 [Sengupta et al. 2018], MobileNet [Howard et al. 2017], and MXNet [Chen et al. 2015].

The Space of Kernel Schedulings. A kernel is an abstract concept that supports different concrete implementations [Jin et al. 2022]. Each implementation, in this article, is called a kernel schedule. The set of every schedule of a kernel is called its search space. The choice of implementation impacts the performance of the kernel. Finding the best schedule for a kernel is an optimization problem whose objective function is running time: The faster a kernel runs, the better that schedule is. The problem of finding exact analytical solutions to kernel scheduling is open, even when fixing the computer architecture [Tollenaere et al. 2023]. Hence, typical techniques are stochastic, take time to converge, and provide no guarantees of optimality [Lebedev and Belecky 2021].

Coordinates and Neighborhoods. Kernel schedules differ due to the application of code transformations, such as tiling, unrolling, and thread blocking. If we fix the sequence of transformations that define the search space, then each schedule is uniquely determined by the transformation parameters: unrolling factor and tiling window per loop, number of threads per block, and so on. These parameters admit a total order: If \(m \lt n, n, m \in \mathbb {N}\) , then an unrolling factor of n is larger than an unrolling factor of m. This ordering determines coordinates on the space of kernel schedules, hence, yielding a notion of a neighborhood. The neighborhood function relates kernel configurations produced by transformation vectors that differ by a minimal difference on one parameter.

The Key Observation and the Implied Hypothesis:. In a convex optimization space, every local minimum is a global minimum. The space of kernel schedules is usually not convex, as Example 3.5 will show. However, we believe that the following expectation applies to the vast majority of machine learning models: It is possible to project the set of all kernel schedules onto a system of coordinates, such that the region between the origin of this space and the fastest kernel configuration form a convex hypersurface with respect to the running time of the kernels. Hence, the optimum configuration can be reached from the origin by deriving the running time function along a continuous neighborhood of kernel configurations. We call this observation the Droplet Expectation and formalize it in Section 3 (see Definition 3.6).

Based on this expectation, this article describes a scheduling technique called Droplet Search, which is currently part of Apache TVM.¹ This article evaluates Droplet Search on six architectures (two x86 CPUs, two ARM CPUs, and two NVIDIA GPUs) and on six models (BERT, ResNet-18, VGG-16, MobileNet, MXNet, and Inception-v3). Section 4 shows that Droplet Search runs up to 10 times faster than the other four search algorithms in AutoTVM [Chen et al. 2018] and the search algorithm in TVM’s Ansor [Zheng et al. 2020]. The kernels produced via Droplet Search tend to outperform those produced by the other search approaches in AutoTVM and approximate those produced by Ansor, even though the latter might use up to four times more transformation parameters. These results, explained in Section 4, come from the following contributions:

Intuition: The droplet expectation is not a theorem. In Section 4.4, we show that it is possible to disprove it using analytical cost models involving discontinuous functions. However, the hypothesis is expected to hold in cost models described by contiguous functions involving only positive domain and coefficients: a property that Renganarayana and Rajopadhye [2008] call “Positivity.” These models are rather common: Table II in Renganarayana and Rajopadhye’s paper lists eight of them from previous work. More recent models [Olivry et al. 2021, 2020] share similar properties, as Section 4.4 discusses.

Simplicity: The implementation of droplet’s search in AutoTVM 0.14.0 (the pseudo code in Figure 4) consists of 127 lines of commented Python code. For comparison, the implementation of TVM XGB’s search [Chen et al. 2018] uses 971 lines in three files (xgboost_tuner.py, xgboost_cost_model.py, and sa_model_optimizer.py).

Adaptability: Droplet Search works in any setting where AutoTVM does, including settings like a Cortex A7, where Ansor cannot be used to optimize MobileNet [Howard et al. 2017].

Efficiency: Droplet Search converges more than twice as fast as the different algorithms available in AutoTVM (random sampling, grid search, genetic search, and XGBoost) and usually is more than four times faster than TVM’s AutoScheduler.

Effectiveness: In almost every experiment we ran, Droplet Search delivered kernels that either match or outperform those produced by the other schedulers in AutoTVM. Compared to Ansor, in a universe of 30 experiments, Droplet Search found faster kernels in 11 settings and lost in 12. However, Ansor uses up to four times more transformations.

2 The Search Space

A kernel admits an abstract view, formed by an iteration space, a data space, and a computation constrained into these zones.² As mentioned in Section 1, this view can be implemented in multiple ways as long as the data dependencies encoded in the kernel’s computation are respected. Each implementation differs in how the iteration space is traversed. This scheduling determines how the kernel’s computation updates the data space. Example 2.1 clarifies these notions.

Example 2.1.

Figure 1 shows an abstract view of the matrix multiplication kernel. The computations performed by the kernel can be indexed by triples \((i, j, k)\) , which form its iteration space. The bounds \(R_A\) , \(C_B\) , and \(C_A\) that delimit this space abide by two constraints. The first, \(C_a = R_B\) , is mandatory for correctness; the second— \(R_A\) is even—we use for the sake of the example. Figure 1 also shows five implementations of the abstract kernel. These implementations produce the same matrix C; however, the order in which the computations occur—the kernel schedule—might vary.

Fig. 1.

Transformation Vectors. As hinted in Example 2.1, in the context of this article, the implementation of kernels differ concerning code transformations. These transformations are guided by parameters. Example 2.2 shows parameters for two well-known transformations: tiling and unrolling.

Example 2.2.

Figure 1(e) shows the kernel that comes from Figure 1(c) after the application of three instances of tiling: a transformation that partitions the iteration space into smaller regions. In this example, tiling happens along the three axes of the iteration space. The dimensions of the tiling window are 8, 32, and 16 points. Each one of these sizes is an optimization parameter. Figure 1(f) shows the kernel produced after an application of unrolling onto the innermost loop of the kernel in Figure 1(c). The unrolling factor, i.e., the transformation parameter, is 2.

The parameters of code transformations can be organized into transformation vectors. A transformation vector is a tuple whose elements represent these parameters. The order in which these parameters appear in the vector determines the order in which transformations are applied to programs. Thus, if a given code transformation has multiple parameters, then these parameters exist sequentially in the transformation vector. Example 2.3 shows how the parameters seen in Example 2.2 can be represented as transformation vectors.

Example 2.3.

Consider the ordered application of the two loop-related compiler optimizations mentioned in Example 2.2 onto the abstract kernel in Figure 1(a): tiling along dimensions A, B, and C and unrolling on dimensions A, B, and C. The vectors representing any application of these code transformations have the format: \(\langle \mathtt {tile}_A:t_A, \mathtt {tile}_B:t_B, \mathtt {tile}_C:t_C, \mathtt {roll}_A:r_A, \mathtt {roll}_B:r_B, \mathtt {roll}_C:r_C \rangle\) . Figure 2 shows the vectors that produce the kernels in Figure 1.

Fig. 2.

The iteration space of an abstract kernel can be traversed in many ways. If we ascribe one loop per dimension of the iteration space, then any ordering of these loops that respects data dependencies is a valid traversal. The implementation of any such ordering, subject to a null transformation vector (the vector representing the absence of transformations), is called a canonical schedule. A kernel may have more than one canonical schedule, as Example 2.4 illustrates. Canonical schedules lead to Kernel Transformation Spaces, which Definition 2.5 formalizes.

Example 2.4.

Figure 1(b) and (c) show canonical schedules for the abstract kernel in Figure 1(a). Any permutations of the loops in Lines 03–05 is a valid canonical schedule. Canonical schedules, by definition, are not subject to code transformations: They represent the application of a null transformation vector. As a consequence, they do not show the effects of typical compiler optimizations, such as loop invariant code motion. Thus, the invariant code in Line 06 of Figure 1(b) remains inside the innermost loop.

Definition 2.5 (Kernel Transformation Space).

Let \(\langle \mathcal {P}_1:p_1, \mathcal {P}_2:p_2, \ldots , \mathcal {P}_n:p_n \rangle\) be a transformation vector, such that each parameter \(p_i\) comes from a range of parameters \(\mathcal {P}_i\) . The Cartesian product \(\mathcal {P}_1 \times \mathcal {P}_2 \times \ldots \times \mathcal {P}_n\) plus a canonical kernel \(\mathcal {K}\) form a kernel transformation space. Each point of this space represents a configuration of \(\mathcal {K}\) transformed by an instance of that Cartesian product. \(\mathcal {K}\) is called the origin of this space and the set of transformation parameters is called the basis of this space. If the basis contains n parameters, then the space is called n-dimensional.

Example 2.6.

The vectors in Example 2.3, plus the canonical kernel in Figure 1(c), form a six-dimensional transformation space. The origin of this space is the kernel in Figure 1(c), and its basis is formed by three parameters of loop tiling and three parameters of loop unrolling.

Not every sequence of code transformations is valid. For instance, unrolling a loop by a factor larger than that loop’s trip count is not meaningful. Thus, Definition 2.5 implicitly assumes that the transformation space is produced by valid transformation vectors. Furthermore, the set of kernels determined by Definition 2.5 is non-exhaustive, even under bounded parameters. Non-exhaustiveness follows from two simplifications that we enumerate below, which are assumed in the construction of the transformation space. These simplifications are also adopted in AutoTVM [Chen et al. 2018] and have been incorporated into further work inspired by it [Zhang et al. 2021] as follows:

(1)

Code transformations are applied to the canonical kernel always in a fixed order. In other words, it is possible that by varying the order in which transformations are applied, the space could be extended with new concrete kernels.

(2)

Every point in the transformation space comes directly from the canonical kernel via the application of one transformation vector. Notice that successive applications of optimizations could, in principle, produce kernels outside the transformation space.

The performance of the kernels that form the transformation space might vary. These variations depend on the scheduling, on the target architecture, and on the dimensions of the iteration and data spaces. The problem of finding the best implementation of a kernel, given these constraints, is known as the Kernel Scheduling Problem, a notion extensively discussed in previous work (see Section 5 for references). To keep this article self-contained, Definition 2.7 restates this concept. Search strategies for Kernel Scheduling are key to optimize deep learning models. Thus, Definition 2.7 is a well-researched problem. Example 2.8 revisits some of these techniques.

Definition 2.7 (Kernel Scheduling Problem).

Given a computing device, a transformation space with origin \(\mathcal {K}\) and basis ( \(\mathcal {P}_1, \ldots , \mathcal {P}_n\) ), plus valid inputs for \(\mathcal {K}\) , kernel scheduling asks for the fastest implementation of \(\mathcal {K}\) in the transformation space.

Example 2.8.

Figure 3 provides a pictographic metaphor for different search techniques for the kernel scheduling problem implemented in AutoTVM and contrasts them with the Droplet Search method that this article introduces. Section 3 shall describe the droplet algorithm. In regards to the other methodologies, the search proceeds as follows:

Grid: This explores a bounded region of the transformation space. Regular ranges of transformation parameters determine this region.

Random: Points of the transformation space are sampled randomly. Sampling usually follows a uniform distribution on predefined bounds placed onto the parameters.

Annealing: Simulated annealing [Kirkpatrick et al. 1988] is a refinement of random search, where sampling alternates between regions that are close and distant from current best points. Points within a neighborhood are sampled, and, from time to time, the center of this neighborhood changes. The probability of such large jumps happening decreases with time.

Fig. 3.

3 Droplet Search

The Droplet Search is a greedy heuristic to explore the kernel transformation space. The proposed technique is a variation of the Coordinate Descent Algorithm,³ with extensions proposed by Richtárik and Takác [2012] to enable synchronous parallelism. Figure 4 provides an overview of the search algorithm evaluated in this article, and Figure 3(d) provides a graphical metaphor to illustrate how the algorithm works: At each step—up to a maximum fixed number of iterations—search chooses the most profitable kernel configuration that is considered a “neighbor” of the current best configuration. The rest of this section discusses each part of this algorithm shown in Figure 4.

Fig. 4.

The successful application of Coordinate Descent depends on a suitable “neighborhood function,” subject of Section 3.1. Search tends to converge to a global optimal, based on the intuition that Section 3.2 provides. Convergence is based on criteria discussed in Section 3.3. This closes with discussions of two optimizations that speed search up: parallelization and speculation (Section 3.4).

3.1 The Neighborhood Function

Any code transformation used in this article is parameterized by one positive integer. Example 2.2 describes some of these parameters. This assumption—transformations defined by one positive integer—forces a total ordering over different instances of a code transformation. The domain of a code transformation is the set of all the values of its parameters. This article works only with numeric domains; however, a domain does not need to be contiguous, as Example 3.1 shows.

Example 3.1.

The list below shows different compiler optimizations and the ordering between their instances. The example takes liberties for the sake of the illustration; i.e., vector lengths are architecture dependent, and not every size might be available, as follows:

Peeling, with parameter \(\mathtt {peel}=n, n \in [0, 1, 2, \dots , \mathtt {maxp}]\) being the number of iterations to peel. The set of values \(\lbrace 0, 1, \ldots , \mathtt {maxp}\rbrace\) is the domain of the transformation.

Unrolling, with parameter \(\mathtt {roll}=n, n \in [0, 1, 2, \dots , \mathtt {maxr}]\) being the unrolling factor.

Thread blocking, (in graphics-processing units) with parameter \(\mathtt {thread}=n, n \in [0, 1, 2, \dots ,\) \(\mathtt {maxt}]\) being the number of GPU threads per block.

Tiling, with parameter \(\mathtt {tile}=n, n \in [N/20, N/15, N/12, N/10, N/6, N/4, N/3, N/2]\) , where N is the size of the iteration space. We assume that N is a multiple of 60. Notice that this example chooses perfect divisors of the iteration space, but different ranges are possible.

Vectorization, with parameter \(\mathtt {vect}=n, n \in [0, 2, 4, 8, 16]\) being the vector length.

In any of the above examples, we say that an instance of a transformation is less than another instance based on the ordering between their parameters. For example, if we consider vectorization ( \(\mathtt {vect}\) ), then \(\mathtt {vect}=m \lt \mathtt {vect}=n\) if, and only if, \(m \lt n\) .

The order between parameters leads to a notion of neighborhood, formalized as follows:

Definition 3.2 (Neighborhood).

Consider two transformation vectors of an n-dimensional space: \(v_1 = \langle \mathcal {P}_1:p_1, \ldots , \mathcal {P}_{i-1}:p_{i-1}, \mathcal {P}_i:x, \mathcal {P}_{i+1}:p_{i+1}, \ldots , \mathcal {P}_n:p_n \rangle\) ; and \(v_2 = \langle \mathcal {P}_1:p_1, \ldots , \mathcal {P}_{i-1}:p_{i-1}, \mathcal {P}_i:y, \mathcal {P}_{i+1}:p_{i+1}, \ldots , \mathcal {P}_n:p_n \rangle\) , such that each p is a transformation parameter within range \(\mathcal {P}\) . Vectors \(v_1\) and \(v_2\) are neighbors along dimension \(i, 1 \le i \le n\) , if they differ only on dimension i and if

(1)

\(x \lt y\) (without loss of generality);

(2)

\(\forall z \in \mathcal {P}_i, z \ne x\) , if \(z \lt y\) , then \(z \lt x\) .

(3)

\(\forall z \in \mathcal {P}_i, z \ne y\) , if \(z \gt x\) , then \(z \gt y\) .

If two vectors are neighbors along one dimension, then they are called neighbors. The set of every neighbor of a given transformation vector is called the neighborhood of that vector.

Each dimension of a transformation vector v is considered separately when determining the neighbors of v. There exist at most two neighbors per transformation parameter; therefore, the number of neighbors of v grows linearly with the number of its dimensions. This observation is essential for scalability: If the neighborhood of a vector considered variations in two or more dimensions, then the size of the neighborhood would be exponential on the number of dimensions.

Example 3.3.

Consider the following transformation vector: \(v = \langle \mathtt {tile}_A: 0, \mathtt {tile}_B:8, \mathtt {tile}_C:16, \mathtt {roll}_A:4, \mathtt {roll}_B:0, \mathtt {roll}_C:0\rangle\) , which applies onto the canonical kernel in Figure 1(c). If we let \(\mathtt {roll}_A = [2, 4, 8, 16, 24]\) , then v has two neighbors along dimension \(\mathtt {roll}_A\) . These neighbors are \(v_1 = \langle \ldots , \mathtt {roll}_A:2, \ldots \rangle\) and \(v_2 = \langle \ldots , \mathtt {roll}_A:8, \ldots \rangle\) .

3.2 Convexity

The goal of this article is to find implementations for abstract kernels that minimize their running times. To this purpose, we define the running time function \(\mathit {RT}\) as follows:

Definition 3.4 (The Running Time Function).

Given (i) a canonical Kernel \(\mathcal {K}\) , (ii) a transformation vector v with parameters \(\mathcal {P}\) , (iii) a computer architecture A, and (iv) input data I for the canonical kernel, we define the running time function \(\mathit {RT}_{A,I,\mathcal {K}}(v) : \mathcal {P} \mapsto \mathbb {R}\) as a function that maps the implementation of \(\mathcal {K}\) optimized by v to the time it takes to process input I on target A.

In practice, \(\mathit {RT}\) is obtained by running the kernel configurations. The notion of neighborhood makes the transformation space a coordinate space: It is possible to define the distance between two transformation vectors. The running time function \(\mathit {RT}\) divides this coordinate space into two halves: If \(\mathit {RT}_{A,I,\mathcal {K}}(v) = t\) , then we have a region formed by \(t_{\mathit {lower}} \lt t\) and a region formed by \(t_{\mathit {higher}} \ge t\) . Thus, \(\mathit {RT}_{A,I,\mathcal {K}}\) determines a hypersurface onto the transformation space. This hypersurface is convex if its local minima are equal to its global minimum. We say that vector v is a minima of \(\mathit {RT}_{A,I,\mathcal {K}}\) if \(\mathit {RT}_{A,I,\mathcal {K}}(v) \lt \mathit {RT}_{A,I,\mathcal {K}}(v^{\prime })\) for any \(v^{\prime }\) within the neighborhood of v. A global minimum is the smallest local minimum in a set. Example 3.5 illustrates these concepts.

Example 3.5.

Figure 5 shows hypersurfaces produced by two running time functions. Each function is parameterized by a different architecture and different inputs (the dimensions of the iteration space). The functions are parameterized by the same canonical kernel—the configuration in Figure 1(b). Two parameters form the transformation space: \(\mathtt {tile}_A \in [0, 20, 40, \ldots , 120]\) , and \(\mathtt {tile}_B \in [0, 20, 40, \ldots , 120]\) . The figure shows, for each running time function, its optimal configuration. The figure also shows at least one local minimum that differs from the optimum. These hypersurfaces are not convex, because they contain local minima that are not globally optimal.

Fig. 5.

The hypersurfaces in Figure 5 are not convex. However, they have the following property: The origin and the global optimum belong to the same convex region. This property remains true, at least for the two settings in Figure 5, if we add more dimensions to the transformation space considered in Example 3.5, such as an extra tiling window, unrolling of the innermost loop, or interchange of any pair of loops. Definition 3.6 states this observation as our working hypothesis.

Definition 3.6 (The Droplet Expectation).

Let \(v_{\mathit {opt}}\) be the optimum kernel of the running time function \(\mathit {RT}_{A,I,\mathcal {K}}\) . We expect \(v_{\mathit {opt}}\) and \(\mathcal {K}\) , the unoptimized kernel, to belong into a a convex contiguous subset of the hypersurface determined by \(\mathit {RT}_{A,I,\mathcal {K}}\) . Hence, we expect the existence of a chain of neighbor vectors \(v_0, v_1, v_2, \ldots , v_{n-1}, v_n\) , where \(\mathcal {K} = v_0\) and \(v_n = v_{\mathit {opt}}\) , with the following properties:

Contiguous chain: \(v_i \in \mathit {neighborhood}(v_{i-1}), 0 \lt i \le n\) ; and

Descending chain: \(\mathit {RT}(v_i) \lt \mathit {RT}(v_{i-1}), 0 \lt i \le n\) .

A continuous hypersurface can be partitioned into convex regions: maximal convex sets formed by the transitive closure of the neighborhood function. Whenever the Droplet Expectation is confirmed, the origin and global optimum belong to the same convex region. Thus, there exists a “downhill path” from origin to optimum that can be found by Coordinate Descent (assuming an idealized running time function without statistical variations). If the Droplet Expectation fails, then the origin and global optimal belong into distinct convex regions. Figure 6 illustrates these ideas.

Fig. 6.

Intuition. The hypothesis stated in Definition 3.6 is not a certainty: It is possible to disprove it with analytical models involving discontinuous functions, as Section 4.4 shows. However, Section 4 demonstrates that the hypothesis holds on a variety of architectures and models. Indeed, the effect of many compiler optimizations can be described by polynomial equations involving only positive coefficients ranging on positive domains. The second derivative of such functions, if they exist, will be always positive. This condition is sufficient to ensure convexity [Bertsekas 2009, Pro.1.1.10]. In the words of Renganarayana and Rajopadhye [2008], “the use of polynomial functions with this property leads to convex optimization problems which can be solved for real solutions in polynomial time.” Renganarayana and Rajopadhye call this property Positivity. Notice that the ordering of the various optimization levels along each dimension of the search space is primarily responsible for making the droplet expectation hold. For instance, going back to Example 3.5, the expansion of the tiling window might be beneficial for performance until the number of elements in this window overgrows the size of the L1 cache. After this point, any further increase will cause cache misses.

3.3 Stop Criterion

The existence of the convex path expected in Definition 3.6 does not ensure that such a path can be discovered via a Coordinate Descent search procedure. Running time has a stochastic nature: This nature implies that the actual evaluation of \(\mathit {RT}_{A,I,\mathcal {K}}\) on a kernel produced by a transformation vector v is prone to variations. Thus, Coordinate Descent might reach suboptimal configurations that are apparently optimal due to measurement fluctuations on \(\mathit {RT}_{A,I,\mathcal {K}}(v)\) .

To increase the reliability of Coordinate Descent, multiple evaluations of each point of the transformation space are in order. However, each further evaluation contributes to increasing the total time necessary to solve scheduling. The implementation of Droplet Search that we analyze in Section 4 performs three evaluations of each point visited in the process of solving scheduling. Three evaluations are the default sampling procedure adopted by AutoTVM. We use Student’s t-test (following Levine [1969]’s implementation) with significance level \(\alpha = 0.05\) to compare the two populations of three samples each.⁴ In other words, the search stops if we reach a transformation vector v with no neighbor that yields a faster kernel configuration with a confidence level of 95%.

3.4 Synchronous Parallelism and Speculation

The algorithm in Figure 4 keeps a current candidate transformation vector, which is initialized with the origin of the optimization space in Line 02 and is updated in Line 13 whenever a faster kernel configuration is found. To find a faster configuration, every point in the neighborhood of the current candidate is considered. The evaluation of these points happens in parallel, as Lines 09 and 10 of Figure 4 shows. However, we have observed that it is often difficult to fill up every available thread with unvisited kernel configurations. Example 3.7 explains how this difficulty emerges.

Example 3.7.

Figure 7 shows how parallel evaluation of kernel configurations happens. Points in the neighborhood of the current candidate are evaluated in synchronous batches. Figure 7 represents these points as gray boxes. Usually, there will be a non-empty intersection of configurations between the neighborhood of the current candidate and the next candidate. Intersecting points will be stored in the visited set seen in Figure 4. In this example, intersecting points are marked with gray numbers in Figure 7(b) and (c). Overlaps reduce parallelism: Only five points are evaluated concurrently; thus, resources will be underutilized in processors with more than five cores.

Fig. 7.

Differential Speculation. We resort to speculation to maximize thread occupancy. Speculation, in the context of this article, is the evaluation of kernel configurations in the extended neighborhood of the current candidate. The extended neighborhood is formed by the neighbors of neighbors. The algorithm in Figure 4 determines these extra points—the Speculative Set—in Line 07. This set is built via differential speculation; the history of previous coordinates of the best candidates determines a speculative set as follows: Let \(v_{n-1}\) be the candidate at iteration \(n-1\) of Coordinate Descent, and let \(v_n\) be the candidate at iteration n. By Definition 3.2, \(v_{n-1}\) and \(v_n\) differ in only one dimension, e.g., \(v_{n-1} = \langle \ldots , \mathcal {P}: x_{n-1}, \ldots \rangle\) and \(v_n = \langle \ldots , \mathcal {P}: x_n, \ldots \rangle\) . We choose a new transformation vector \(v_s = \langle \ldots , \mathcal {P}: x_n + s, \ldots \rangle\) , where \(x_n + s\) is the value that follows \(x_n\) within the parameter \(\mathcal {P}\) . By the nature of Coordinate Descent, it is likely that \(v_s\) is not in the visited set. Nevertheless, if the centroid of the speculative set is in the visited set, then Line 07 of Figure 4 still ensures that multiple evaluations will not happen. Given \(v_s\) , we let the speculative set be its neighborhood. Line 07 of Figure 4 adds this set to the batch of configurations waiting to be evaluated.

Example 3.8.

Figure 7(d) provides some idea of how speculation works. We let S be the centroid of a speculative set. This kernel configuration, S, is chosen by extending the overall direction of the Coordinate Descent path speculatively: an action represented by the longer arrow in Figure 7(d). In this example, speculation increases the number of active threads from 5 to 10.

Even though we restrict speculation to the extended neighborhood of the current candidate, this technique can still cause the line search of Coordinate Descent to leave a convex region. However, Section 4.5 shows that such event is not happening in practice. Had we adopted larger speculation steps, i.e., outside the extended neighborhood of the current candidate, then the risk of leaving the convex region would be higher.

4 Evaluation

This section compares Droplet Search with similar techniques employed in the implementation of Apache TVM. To this effect, this section investigates the following research questions:

RQ1: Can we demonstrate the Droplet Expectation in different architectures, at least for transformation vectors involving a small number of dimensions?

RQ2: How effective is Droplet Search on end-to-end models compared to other search techniques on different computer architectures?

RQ3: What is the average number of samples that Droplet Search takes to converge to optimal results, compared to other search techniques?

RQ4: Does the Droplet Search converge to a global optimum when applied to an industrial-quality analytical cost model?

RQ5: How does the number of threads used in Droplet Search influence the convergence time of the algorithm and the quality of the model that it finds, with and without speculation?

RQ6: How do kernels tuned via AutoTVM and Ansor compare to hand-written implementations of kernels in TensorFlow?

RQ7: How does the behavior of Droplet Search vary, in terms of quality and speed, on each individual kernel that constitutes a machine-learning model?

RQ8: What is the impact of the confidence level on the convergence rate of Droplet Search in terms of search speed and quality of the final model?

Hardware. This section evaluates scheduling approaches on six computer architectures, which Figure 8 enumerates. This mix contains two general-purpose desktop architectures (AMD and Intel), two embedded system-on-chips (ARM), and two graphics processing units (NVIDIA).

Fig. 8.

Software. This section uses Apache TVM v0.11.1, released in March 2023. We implemented Droplet Search as part of AutoTVM. AutoTVM also provides four other search techniques: grid, random, genetic (GA), and simulated annealing (XGB). XGB is described by Chen et al. [2018]. This technique uses a cost model to guide simulated annealing. The constants that define this cost model are learned during a training phase. We also compare our implementation of Droplet Search with AutoScheduler (Ansor) [Zheng et al. 2020], also available in Apache TVM v0.11.1.

Benchmarks. This section evaluates kernel scheduling on five convolutional neural networks and on one encoder stack of transformers (BERT). Figure 9 enumerates the models. The schedulers in AutoTVM, including Droplet Search, use the same optimization parameters. We have not chosen these parameters: They are pre-defined (per model) in the distribution of AutoTVM. Optimization parameters differ depending on the computer architecture. In the CPUs, these parameters refer to two optimizations: tiling and unrolling. In the GPUs, they concern two more: thread blocking (the ability to partition CUDA threads into blocks) and shared memory tiling (the ability to bring data from global to shared memory). All the parameters are divisors of boundaries of the iteration space. For instance, the size of a tile window must be a divisor of the size of the iteration space along the tiled dimension. This fact removes discontinuities from the search space (code that the compiler inserts to handle boundary conditions), as Section 4.4 will explain.

Fig. 9.

Ansor uses more optimizations than AutoTVM. Although AutoTVM is restricted to a single transformation vector (in TVM’s parlance, a kernel template), Ansor creates several of them. The extra templates effectively give Ansor access to inter-kernel optimizations, such as loop fusion and fission: It can merge operators (i.e., kernels) in the computational graph, for instance. Figure 9 shows the number of parameters in the largest template that Ansor explores for each network. The optimizers used in AutoTVM—Droplet Search included—are restricted to intra-kernel optimizations. Nevertheless, we hope to demonstrate that even accessing a smaller pool of optimizations, Droplet Search can be competitive with Ansor, which resorts to more extensive code transformations.

4.1 RQ1: The Droplet Expectation

Definition 3.6 specifies that a behavior likely to characterize typical implementations of linear kernels. This expectation is not a guarantee; thus, while we anticipate to find a path from origin to optimum along which performance improves gradually, such a path might not exist. In this case, Droplet Search will be stuck on a local minimum. Nevertheless, this section provides some empirical evidence that the Droplet Expectation holds. Section 4.2 provides further evidence: The best kernel configurations found by Coordinate Descent are similar to the best configurations found via more extensive techniques, even when speculation is enabled.

Methodology. We have analyzed the behavior of two kernels: matrix multiplication and convolution on six architectures. In this experiment, we use two-dimensional spaces for the sake of visualization. On CPUs (ARM or AMD), the transformation space is formed by the tiling of the two innermost loops of each kernel. Tile sizes, in both dimensions, are \(0, 8,16,24, \ldots ,128\) ; hence, we have a \(17 \times 17\) transformation space. On GPUs, the transformation space is formed by the number of threads in two-dimensional thread blocks. The possible number of threads are \(1, 2, 4, \ldots , 32\) , also forming a \(17 \times 17\) transformation space. Convolution uses a \(1{,}024 \times 1{,}024\) matrix with a \(3 \times 3\) filter. Multiplication uses the following matrix sizes: \(1{,}000 \times 700\) , \(700 \times 800\) , and \(1{,}000 \times 800\) .

Fig. 10.

Discussion. The Droplet Expectation holds in every one of these \(2 \times 6\) scenarios. Figure 10 shows representations for nine of these spaces, highlighting the path taken by Coordinate Descent. Some configurations involving thread blocking on the GPU are not valid; thus, they are not evaluated. In every scenario, the optimal kernel configuration is close to the origin, as the paths in Figure 10 emphasize. The Droplet Expectation holds if we increase the number of dimensions in the transformation space or the number of kernels in the model. Section 4.2 discusses this experiment.

4.2 RQ2:- End-to-end Effectiveness

The goal of any kernel scheduling technique is to find efficient implementations of computational graphs involving kernels. This section investigates how Droplet Search fares in such task, compared with other scheduling techniques, when optimizing well-known machine learning models.

Methodology. We evaluate six end-to-end models on six architectures, with a hard limit of 10,000 evaluations per search technique. This arbitrary limit prevents experiments from running for too long. The Droplet Search will stop after evaluating 10,000 kernel configurations; however, it tends to stabilize earlier, as discussed in Section 4.8. The other search techniques do not have a notion of premature convergence; hence, they evaluate 10,000 configurations. Notice that Figure 9 shows that every transformation space contains more than 10,000 configurations. All the search algorithms evaluate kernel configurations in parallel. The parallelism in Droplet Search follows the ideas from Section 3.4. We use a confidence interval of 95% when comparing a candidate kernel configuration with kernels within its neighborhood. Section 4.3 discusses the impact of this last choice.

Fig. 11.

Discussion. Figure 11 compares the effectiveness of different search techniques. Considering only the search techniques in AutoTVM, Droplet Search generally yields the best kernels or ties for the best (usually with XGB). Its search time, however, is faster. On the AMD CPU, for instance, Droplet Search optimizes VGG-16 in 10% of the time the other scheduling approaches require. Ansor tends to produce faster kernels than the techniques implemented in AutoTVM, including Droplet Search. Ansor explores more optimization parameters: It has access to a large number of kernel templates, whereas AutoTVM uses only one. Nevertheless, due to excessive memory consumption, Ansor (and also AutoTVM’s XGB) could not be used to schedule our largest models in the ARM boards. We have observed that Droplet Search does not perform well on the GPUs. In this case, the size of the search space (from 200M to 2.7T configurations, as seen in Figure 9) forces Coordinate Descent to converge due to the limit of iterations in every model, except ResNet-18. Incidentally, on ResNet-18 Droplet Search produces faster kernels than Ansor in any GPU.

4.3 RQ3: Stop Criteria

As Section 3.3 explains, our current implementation of Droplet Search stops once it reaches a candidate kernel configuration faster than all its neighbors. Statistical significance between runtime differences is determined via Student’s t-test applied over two populations consisting of three samples each, with a confidence level of 95%. Yet, measurements might fluctuate, and premature termination is possible. If we tighten the confidence level, then termination might happen too early. If we lose it, then convergence can take too long, and Coordinate Descent might visit configurations that are not statistically significantly faster. This section investigates how Droplet Search fares once we vary the confidence level for comparing kernel configurations.

Methodology. We evaluate the five deep-learning models listed in Figure 9 on the Intel i7 using five different levels of confidence: 99%, 95%, 90%, 75%, and no test. In the latter case, we use the absolute arithmetic average of three samples to determine which kernel is faster.

Fig. 12.

Discussion. We could observe almost no variation in the quality of the best kernel configuration depending on the confidence level. This result seems to indicate, at least for the five models running on the Intel i7 CPU, that there exists a number of “acceptable best” kernels with very similar dynamic behavior. However, the search time increases—albeit slightly—once we move from high confidence levels toward low confidence levels. This growth in search time happens, because more kernel configurations tend to be visited by the search procedure.

4.4 RQ4: Analytical Models

Analytical cost models are systems of equations that predict the cost of a program (CPU cycles, I/O operations, cache misses, etc.) given a model of the hardware. Recently, different research groups have designed analytical cost models to estimate the performance of machine-learning kernels [Olivry et al. 2021; 2020, Sumitani et al. 2023, Tollenaere et al. 2023, Zhang et al. 2021]. This section investigates if the droplet expectation also holds in some of these models.

Methodology. This section evaluates cost models taken from three different sources. The first models were proposed by Olivry et al. [2021]. They estimate a lower bound for the amount of data movement between slow and fast memories. The second model is part of the Xtensa Neural Network Compiler (XNNC), from Cadence Tensilica, and was made available to us through the Cadence Academic Network.⁵ This model estimates the number of execution cycles that a program compiled by XNNC takes to execute on Cadence DSPs. Finally, we analyze the eight analytical models listed in Table II of Renganarayana and Rajopadhye [2008].

4.4.1 Olivry et al.’s Cost Models.

We evaluate the Olivry et al.’s model on the tiled version of matrix–matrix multiplication in the authors’ original work (Listing 1 [Olivry et al. 2021]). This program is the default example in Olivry et al.’s online tool.⁶ We evaluate it in a system with two caches, with 64 and 256 KB—the dimensions of the first two cache levels used in the Intel i7.

Discussion. Figure 13 shows hypersurfaces produced by the cost models generated via Olivry et al.’s online tool. Each figure relates the size of a bidimensional tiling window with the I/O cost in terms of memory transfers. The lower the cost, the faster the program is expected to run. Figure 13 (a) assumes one level of cache (with 64 KB). The other two figures assume two levels (64 K and 256 KB). Figure 13(b) and (c) are similar to the surfaces seen in Figure 5, which explore the same program, albeit on an actual machine. In the three figures, the droplet expectation (Definition 3.6) holds. This result is not a coincidence: Olivry et al.’s cost model involve only positive quantities (domain and coefficients) and, thus, form convex surfaces.

Fig. 13.

4.4.2 Tensilica’s Cost Models.

We evaluate the Tensilica model on an implementation of the tiled ReLU kernel on two digital signal processors, called P1 and P6. We chose these two processors, because they are used as tests in the Tensilica tool. The DSPs do not have a cache; however, they contain local memory and system memory. Hence, the compiler must implement direct memory access (DMA) transfers from the system to local memory and tiling reduces DMA operations.

Discussion. In contrast to Olivry et al.’s cost model, the equations produced by XNNC take into consideration the fact that the tiling window might not be a perfect divisor of the loop’s iteration space. If tiling does not perfectly divide the iteration space, then the XNNC compiler generates epilogue code to fetch data outside the tiled loop. Consequently, the performance model contains conditionals. For instance, Figure 14(a) was produced by a (simplified) equation like \(\mathtt {cost} = C_1 + (\mathtt {if} \ W\%63 \ \mathtt {then} \ C_2 \ \mathtt {else} \ 0) + C_3 \times (H/64)\) . The coefficients \(C_i\) represent costs of particular instructions; W and H are tile sizes. Due to these conditionals, the cost models are represented by discontinuous functions. In this case, the droplet expectation does not hold. However, if we restrict valid neighborhoods to only perfect divisors of the iteration space, then the Droplet Expectation holds. For instance, starting with tiling windows with size 16 or greater, coordinate descent achieves the optimal configuration in any of the surfaces seen in Figure 14 after three iterations. Notice that this restriction is unnecessary if some loss is acceptable. When applied to large deep-learning networks—under the same XNNC analytical model—Droplet Search stays very close to the global optimum, although sampling less than 1% of the space covered by an exhaustive grid search.

Fig. 14.

4.4.3 Renganarayana and Rajopadhye’s Cost Models.

Renganarayana and Rajopadhye [2008] show that models used to solve the “Tile Size Selection problem” are represented by equations whose coefficients and domains are all positive quantities. To support their observation, they list equations taken from eight different analytical models from previous work. These equations all represent instances of the bidimensional tile-size selection problem. They use variables that range on the following quantities:

: the size of the cache in the target computer architecture;

: the length of the cache line;

: the height of the rectangular tiling window;

: the width of the rectangular tiling window; and

: the side of an \(\mathbf {n} \times \mathbf {n}\) array.

In this section, we fix C, L, and n, and plot the hypersurface formed by h and w, within a contiguous range of values. This approach simulates Apache TVM’s grid search algorithm.

Discussion. Figure 15 shows hypersurfaces for the different equations analyzed by Renganarayana and Rajopadhye. To ease visualization, we remove the h and w axes (all ranging on the interval \([1, \ldots , 100]\) ). The Droplet Expectation holds in all these analytical models. In fact, all of these equations use exclusively positive coefficients, hence, yielding convex surfaces.

Fig. 15.

4.5 RQ5; Parallelism and Speculation

As explained in Section 3.4, the implementation of Droplet Search evaluated in this article uses synchronous parallelism and speculation to speed convergence up. This section evaluates the effects of these techniques on search time and on the quality of kernel configurations.

Methodology. We evaluate Droplet Search on the five different end-to-end models on the AMD 3700X CPU. This CPU runs 16 threads (two threads per core, with eight cores). This section experiments with three versions of Droplet Search: its full implementation, an implementation that features parallelism but no speculation, and an implementation that runs on a single thread. We report p-values comparing three executions of the fastest models found with each approach.

Fig. 16.

Discussion. Figure 16 compares the different implementations of Droplet Search. Its single-threaded implementation takes 664 s to converge (sum of tuning time over five networks). The parallel version, with 16 threads, converges in 358 s. With speculation plus parallelism, this time goes down to 291 s. Parallelism is a known limitation of Coordinate Descent. Our implementation suffers from the shortcomings mentioned by Zheng et al. [2000]. The synchronous nature of coordinate descent limits concurrency: The best candidate is chosen after every point of a neighborhood is evaluated. Thus, progress only happens once the slowest point runs. Wang et al. [2016] have shown that it is possible to improve parallelism if more candidate points co-exist. We believe that early preemption of slow points could also speed our implementation up: Once the best candidate is found in a neighborhood, the other threads can be aborted. We leave such approaches—multiple candidates and early preemption—open for future work. Speculation improves the running time of our implementation of Droplet Search by a small margin. The parallel version of Droplet Search, with speculation, is 27% faster than the non-speculative implementation (geomean over speedups). This gain comes mostly from faster convergence.

Figure 16 (right) shows that the extended neighborhood explored via speculation has no effect on the speed of the kernel configurations found via Droplet Search. The search does not find always the same final configuration for every layer of every model; however, this phenomenon is due just to statistical variations in the running time of similar kernels. As the figure shows, the p-values reported by a t-test on the speed of the different kernels is well above 0.05. Thus, the extended neighborhood is not changing the behavior of our implementation of Droplet Search.

4.6 RQ6: Comparison with TensorFlow

The goal of this section is to bring some perspective about our results to readers that are not familiar with the Apache TVM ecosystem. To this end, we shall present a comparison between three different approaches to develop end-to-end models: AutoTVM, Ansor, and TensorFlow [Abadi et al. 2016]. The latter is a Python-based library to write machine learning models. In contrast to AutoTVM or Ansor, TensorFlow does not do, by itself, any form of scheduling: The programmer must feed it with an optimized implementation of a machine learning model.

Methodology. This section compares the different kernel implementation approaches using a benchmark collection formed by six kernels: matrix multiplication, two-dimensional (2D) convolution, depthwise separable convolution, pooling, matrix reduction, and 2D ReLU. These kernels have been taken from the artifact made publicly available by Zhu et al. [2022] and are meant to run on graphics processing units. We evaluate them into our RTX 3080 GPU and on an A6000.⁷ Incidentally, we have not been able to apply Zhu et al.’s tool onto these very kernels.⁸

Discussion. Figure 17 shows the comparison between different kernel implementation systems. Notice that Figure 17 compares different kernel implementations. Ansor and AutoTVM (Droplet Search and XGB) receive, as input the same code: a kernel implemented in Python with libraries from Apache TVM. TensorFlow receives a different implementation: kernels also written in Python, but with libraries from the TensorFlow package. As an example, the API to invoke the ReLU kernel is tf.nn.relu(a_tf) in TensorFlow and topi.nn.relu(a_tvm) in Apache TVM. In short, Figure 17 is comparing two different Python libraries.

Fig. 17.

In every experiment, Apache TVM has produced faster (or equally faster) kernels than TensorFlow, be it through AutoTVM or Ansor. However, without scheduling, TensorFlow outperforms Apache TVM in two kernels: convolution and depthwise convolution. We have not observed a statistical difference between implementations of the reduction and the ReLU kernels, regardless of the library or the scheduling approach adopted to optimize them. In this regard, we observe that neither Ansor nor AutoTVM implement search for pooling, reduction, and ReLU. The implementation of these kernels, as provided by Zhu et al., does not come with a template of optimization parameters. The other kernels, in contrast, come with templates that enable thread blocking and tiling with shared memory. Loop unrolling is not enabled by these parameters. Nevertheless, Figure 17 confirms some of the results earlier observed by Zhu et al.: Kernels produced by Apache TVM tend to outperform similar kernels produced via TensorFlow. However, contrary to Zhu et al., we have observed smaller differences.

4.7 RQ7: Intra-Kernel Behavior

The models explored in Section 4.2 consist of multiple kernels: Each layer is implemented as an independent kernel. Droplet Search and all the other search techniques available in AutoTVM are intra-kernel. Thus, kernels are optimized independently from each other. This section analyzes the effects of Droplet Search on individual kernels and compares these effects with results obtained by other scheduling techniques. By showing that Droplet Search finds similar configurations as exhaustive techniques, we provide further evidence that the droplet expectation is common.

Methodology. We analyze each kernel of ResNet-18 and VGG-16 in separate, reporting search time and kernel running time. Each kernel is extracted from its encompassing model via TVM’s code generator. ResNet-18 and VGG-16 are our smallest networks, in number of kernels. We restrict this study to only two models, because the individual analysis of each kernel is time-consuming (meaning human time, not machine time). Nevertheless, we believe that these results could be extrapolated to the other models, which sport similar implementations.

Fig. 18.

Discussion. Figure 18 shows how the search techniques fare on each layer of ResNet-18 and VGG-16. We do not show results for Ansor, because, as Figure 9 shows, Ansor’s implementation recognizes more layers on each model. Droplet Search never yields worse configurations than the other search techniques. Furthermore, its search is faster, converging with fewer samples. In this case, column “iter” in Figure 18 provides the number of kernel configurations evaluated by each search technique. In contrast to Droplet Search, the other approaches used in AutoTVM do not have a notion of convergence. The search stops once a determined number of kernel configurations is visited. However, sampling is not equally divided among the kernels: AutoTVM draws more samples for kernels that run for more time. As a example, the different search procedures of AutoTVM sample Layer Five of ResNet-18 1,024 times, as Figure 18 shows. Droplet Search, in contrast, stops after 120 iterations. The samples that Droplet Search evaluates depends more on the dimensions of the layer, such as the number of channels and width and height of the filter used. The four largest layers in Figure 18 are, in this order: the \(6{\rm th}\) , the \(8{\rm th}\) , the \(9{\rm th}\) , and the \(11{\rm th}\) . Incidentally, these layers account for the largest number of samples observed by Droplet Search.

4.8 RQ8: Convergence Rate

The convergence rate of a search mechanism used to solve the kernel scheduling problem measures how fast that technique closes on its final solution. As mentioned in Section 4.7, the different techniques that AutoTVM uses to solve kernel scheduling iterate until a fixed number of configurations are evaluated. We have observed that it is often possible to stop iterations before, once a sufficiently good configuration is reached. That is the approach adopted in Droplet Search, as Section 3.3 explains. In what follows, we investigate how the quality of the final solution to scheduling improves as the number of evaluations progresses.

Methodology. We set the maximum number of iterations of AutoTVM’s grid, random, genetic, and XGB search to 10,000. This number, 10K, includes the evaluations of kernel configurations in neighborhoods or speculative sets that Droplet Search uses. Ansor shall use the same limit of evaluations. We then inspect the speed of the best kernel configuration that each one of these search techniques find RestNet-18 throughout the search. We show results for ResNet-18 only; however, we have evaluated the convergence rate for the other models. Results tend to be similar.

Fig. 19.

Discussion. Figure 19 shows the results of this experiment. Droplet Search usually converges to the best solution before 10,000 evaluations. The fastest convergence was observed on the Intel CPU: Coordinate Descent stabilized after 1,278 configurations were evaluated. Notice that convergence happened before 2K evaluations in every CPU. The slowest convergence happened on the GTX GPU: 7,125 evaluations. Similarly, on the RTX, 6,502 configurations were evaluated. Similar results were also observed on the other four models: Droplet Search stabilizes before 10K iterations; typically before 3K iterations on the CPUs, and before 8K iterations on the GPUs.

5 Related Work

This article aims to find the best concrete implementation for a given program. We recognize two main approaches to this theme, which we shall call autotuning (e.g., compiler autotuning) and scheduling (e.g., program autotuning, following Tollenaere et al. [2023] taxonomy). The latter is the problem from Definition 2.7. These problems differ in two essential ways:

Training: In autotuning, the compiler is trained offline on many programs—its training set—before being applied to an unknown program. Thus, the compiler uses information acquired from general programs before optimizing a specific program. In scheduling, there is no pre-training phase: The compiler does not try to generalize the behavior of a universe of programs to predict the behavior of an individual program.

Sampling: In autotuning, the compiler typically evaluates the target program once (although there are exceptions, like in the work of Cavazos et al. [2007]), using, as a guide, the behavior learned from observations made on the training set. In scheduling, the compiler is allowed to run the target program multiple times. Information acquired from these executions will guide the search for good optimizations.

Much of the current techniques employed in autotuning originate in the work of Cavazos and his collaborators [Agakov et al. 2006; Ashouri et al. 2018, 2016; Cavazos et al. 2006; Cavazos and O’Boyle 2006; Moss et al. 1997; Simon et al. 2013; Thomson et al. 2010]. The growing availability of predictive models and benchmarks to train these models has made autotuning works common in the recent literature [Brauckmann et al. 2020; Cummins et al. 2021a;, 2021b; Da Silva et al. 2021; Silva et al. 2021]. Figure 20 compares six autotuning-based techniques with six scheduling-based approaches. Although autotuning typically concerns the optimization of general programs, scheduling is mainly seen in the optimization of deep-learning models composed of kernels. Autotuning has been used, for instance, to find good sequences of optimizations for clang [da Silva et al. 2021; Silva et al. 2021] or to fine-tune the Hotspot JIT compiler [Cavazos and O’Boyle 2006].

Fig. 20.

The transformation Space. Scheduling techniques usually fix the sequence of optimizations that form the search space and vary their parameters [Chen et al. 2018; Tollenaere et al. 2023; Zhang et al. 2021]. However, there are scheduling approaches that accept different transformation vectors [Phothilimthana et al. 2021; Meng et al. 2022; Zheng et al. 2020; Essadki et al. 2023]. Within the TVM community, these different transformation vectors are called templates. For instance, a common technique—adopted in TVM’s Ansor—is to assume that each interchange of loops forms a different template. Figure 20 distinguishes fixed vectors as “params” and templated vectors as “param list.” Autotuning usually fix the parameters of each optimization; however, they have more freedom to compose sequences of optimizations. For instance, Cavazos et al. [2007] form sequences of 500 optimizations drawn from a universe of 121 possible compilation flags. This approach is also adopted by Silva et al. [2021]: They produce lists of up to 100 elements drawn from approximately 80 compilation flags. Figure 20 uses the notation “list” to denote this way to build the search space. In Figure 20, we denote fixed-length vectors as “vec \([n]\) .”

The Search Guide. Techniques used in autotuning or scheduling differ in how the search for good transformation sequences is performed. In this article, we use Coordinate Descent. Our approach does not depend on static characteristics of the program; only on its dynamic behavior. Several search techniques use program features (static characteristics) to steer the search. These techniques are usually data-agnostic. An exception is the work of Da Silva et al. [2021], who use the runtime values of inputs to choose program configurations. Da Silva et al. capitalize on the convexity of the search space; however, in their case, tuning is guided by linear regression, not Coordinate Descent. Recent scheduling techniques have used analytical models to prune the search space [Kaufman et al. 2021; Zhang et al. 2021; Tollenaere et al. 2023; Mogers et al. 2022]. Mogers et al. [2022] have shown, in the context of the Lift compiler [Steuwer et al. 2016], that pruning can be very effective, as “only 1 out of 49,000 candidates [generated by random search to optimize convolution] satisfies the constraints [hence is valid].” Pruning is orthogonal to the search techniques that Section 5 evaluates. For instance, an interesting continuation of the ideas in this article would be to use Tollenaere et al.’s cost model to remove from the active neighborhood kernel configurations that are unlikely to improve on the best candidate seen by coordinate descent.

6 Conclusion

This article has introduced a new kernel scheduling technique—Droplet Search—and has demonstrated its effectiveness by optimizing the code of six different deep learning models on six different hardware architectures. Droplet Search relies on the following observation: The optimization space formed by code transformation parameters usually determines a convex region that includes the origin of this space (no optimization) and the best kernel configuration that this space contains. Experiments show that Droplet Search tends to find kernels as efficient as any of the other search techniques available in AutoTVM; however, it does so faster. Droplet Search also compares well with TVM’s Ansor, despite using a much smaller pool of optimizations.

Droplet Search is currently available in Apache TVM. This implementation still offers room for improvements. In particular, its convergence rate on large search spaces can be slow. Furthermore, this implementation could benefit more from parallelism, because it keeps only one best candidate at any time. We conjecture that Coordinate Descent could be modified to use multiple line searches, hence, offering more opportunities for parallelization. Finally, the current implementation of Droplet Search explores the parameters of only four TVM optimizations: tiling, unrolling, thread blocking, and shared memory tiling. Adding more optimizations to this list would be a welcome improvement to that implementation. All these ideas are directions that we hope to see explored in the future.

Acknowledgment

The authors extend their appreciation to the ACM Transactions on Architecture and Code Optimization reviewers for their valuable suggestions, which significantly contributed to the enhancement of this work.

Footnotes

Release v0.13.0 (https://github.com/apache/tvm/pull/14683).

Notions of iteration and data space are standard in the compiler literature. Such concepts appeared independently in the works of Feautrier [1991] and Wolf and Lam [1991], eventually leading to the concept known today as the Polyhedral Model.

It is not clear who invented Coordinate Descent. Descriptions of the algorithm can be found in classic textbooks [Zangwill 1969]. For a comprehensive overview, we recommend the work of Wright [2015].

⁴

Previous work has observed that speedups due to compiler optimizations may not follow a normal distribution [Álvares et al. 2021]. Student’s t-test is parametric, and, hence, not recommended for non-Gaussian distributions. However, non-parametric tests also have shortcomings. In particular, they tend to require more samples. As an example, the minimum recommended number of samples for Wilcoxon’s [Wilcoxon 1992] non-parametric test would be five, under a confidence level of 95%.

⁵

https://www.cadence.com/ko_KR/home/company/cadence-academic-network/university-program.html

⁶

Available at https://iocomplexity.corse.inria.fr/ioub (June 2023).

⁷

The A6000 GPU is only used in this section. Access to this hardware was kindly provided by the Discovery Lab (https://discovery.ic.unicamp.br/).

⁸

Roller, the tool implemented by Zhu et al. relies on RT Cores to speedup the execution of kernels. This tool uses functions that are exclusive of Cuda v10.2 and TensorFlow v1.15.2. However, the RTX 3080—the only GPU that we have with RT cores—is only compatible with Cuda v12.0 and TensorFlow v2.13.0. We could not downgrade the version of Cuda. Furthermore, a direct change of APIs, to upgrade the versions of Cuda and TensorFlow in Roller was not enough to let us reuse the tool: The updated version of the tool compiles successfully; however, it does not produce kernels. We faced similar issues when trying to reuse another artifact that targets RT cores: Heron [Bi et al. 2023]. Heron was implemented with the Cuda v11.0 API. We have also updated it to use Cuda v12.0. The updated version compiles and produces kernels; however, these kernels crash during execution.

References

[1]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning. In OSDI. USENIX Association, New York, NY, 265–283.

Abstract

1 Introduction

2 The Search Space

3 Droplet Search

3.1 The Neighborhood Function

3.2 Convexity

3.3 Stop Criterion

3.4 Synchronous Parallelism and Speculation

4 Evaluation

4.1 RQ1: The Droplet Expectation

4.2 RQ2:- End-to-end Effectiveness

4.3 RQ3: Stop Criteria

4.4 RQ4: Analytical Models

4.4.1 Olivry et al.’s Cost Models.

4.4.2 Tensilica’s Cost Models.

4.4.3 Renganarayana and Rajopadhye’s Cost Models.

4.5 RQ5; Parallelism and Speculation

4.6 RQ6: Comparison with TensorFlow

4.7 RQ7: Intra-Kernel Behavior

4.8 RQ8: Convergence Rate

5 Related Work

6 Conclusion

Acknowledgment

Footnotes

References

Index Terms

Recommendations

FLEP: Enabling Flexible and Efficient Preemption on GPUs

FLEP: Enabling Flexible and Efficient Preemption on GPUs

FLEP: Enabling Flexible and Efficient Preemption on GPUs

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations