Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

The Droplet Search Algorithm for Kernel Scheduling

Published: 21 May 2024 Publication History

Abstract

Kernel scheduling is the problem of finding the most efficient implementation for a computational kernel. Identifying this implementation involves experimenting with the parameters of compiler optimizations, such as the size of tiling windows and unrolling factors. This article shows that it is possible to organize these parameters as points in a coordinate space. The function that maps these points to the running time of kernels, in general, will not determine a convex surface. However, this article provides empirical evidence that the origin of this surface (an unoptimized kernel) and its global optimum (the fastest kernel) reside on a convex region. We call this hypothesis the “droplet expectation.” Consequently, a search method based on the Coordinate Descent algorithm tends to find the optimal kernel configuration quickly if the hypothesis holds. This approach—called Droplet Search—has been available in Apache TVM since April of 2023. Experimental results with six large deep learning models on various computing devices (ARM, Intel, AMD, and NVIDIA) indicate that Droplet Search is not only as effective as other AutoTVM search techniques but also 2 to 10 times faster. Moreover, models generated by Droplet Search are competitive with those produced by TVM’s AutoScheduler (Ansor), despite the latter using 4 to 5 times more code transformations than AutoTVM.

1 Introduction

In the context of this article, a kernel is a function that reads and write data indexed by a linear combination of natural numbers. Kernels are typically implemented as nests of affine loops. Examples of kernels include matrix multiplication, transposition, and convolutions. Following Jin et al. [2022], we say that a deep learning model is a function implemented as alternating layers of kernels and non-linear functions (sigmoid, rectified linear unit (ReLU), etc.). Examples of deep-learning models include neural networks, such as BERT [Devlin et al. 2019], ResNet-18 [He et al. 2015], VGG-16 [Sengupta et al. 2018], MobileNet [Howard et al. 2017], and MXNet [Chen et al. 2015].
The Space of Kernel Schedulings. A kernel is an abstract concept that supports different concrete implementations [Jin et al. 2022]. Each implementation, in this article, is called a kernel schedule. The set of every schedule of a kernel is called its search space. The choice of implementation impacts the performance of the kernel. Finding the best schedule for a kernel is an optimization problem whose objective function is running time: The faster a kernel runs, the better that schedule is. The problem of finding exact analytical solutions to kernel scheduling is open, even when fixing the computer architecture [Tollenaere et al. 2023]. Hence, typical techniques are stochastic, take time to converge, and provide no guarantees of optimality [Lebedev and Belecky 2021].
Coordinates and Neighborhoods. Kernel schedules differ due to the application of code transformations, such as tiling, unrolling, and thread blocking. If we fix the sequence of transformations that define the search space, then each schedule is uniquely determined by the transformation parameters: unrolling factor and tiling window per loop, number of threads per block, and so on. These parameters admit a total order: If \(m \lt n, n, m \in \mathbb {N}\) , then an unrolling factor of n is larger than an unrolling factor of m. This ordering determines coordinates on the space of kernel schedules, hence, yielding a notion of a neighborhood. The neighborhood function relates kernel configurations produced by transformation vectors that differ by a minimal difference on one parameter.
The Key Observation and the Implied Hypothesis:. In a convex optimization space, every local minimum is a global minimum. The space of kernel schedules is usually not convex, as Example 3.5 will show. However, we believe that the following expectation applies to the vast majority of machine learning models: It is possible to project the set of all kernel schedules onto a system of coordinates, such that the region between the origin of this space and the fastest kernel configuration form a convex hypersurface with respect to the running time of the kernels. Hence, the optimum configuration can be reached from the origin by deriving the running time function along a continuous neighborhood of kernel configurations. We call this observation the Droplet Expectation and formalize it in Section 3 (see Definition 3.6).
Based on this expectation, this article describes a scheduling technique called Droplet Search, which is currently part of Apache TVM.1 This article evaluates Droplet Search on six architectures (two x86 CPUs, two ARM CPUs, and two NVIDIA GPUs) and on six models (BERT, ResNet-18, VGG-16, MobileNet, MXNet, and Inception-v3). Section 4 shows that Droplet Search runs up to 10 times faster than the other four search algorithms in AutoTVM [Chen et al. 2018] and the search algorithm in TVM’s Ansor [Zheng et al. 2020]. The kernels produced via Droplet Search tend to outperform those produced by the other search approaches in AutoTVM and approximate those produced by Ansor, even though the latter might use up to four times more transformation parameters. These results, explained in Section 4, come from the following contributions:
Intuition: The droplet expectation is not a theorem. In Section 4.4, we show that it is possible to disprove it using analytical cost models involving discontinuous functions. However, the hypothesis is expected to hold in cost models described by contiguous functions involving only positive domain and coefficients: a property that Renganarayana and Rajopadhye [2008] call “Positivity.” These models are rather common: Table II in Renganarayana and Rajopadhye’s paper lists eight of them from previous work. More recent models [Olivry et al. 2021, 2020] share similar properties, as Section 4.4 discusses.
Simplicity: The implementation of droplet’s search in AutoTVM 0.14.0 (the pseudo code in Figure 4) consists of 127 lines of commented Python code. For comparison, the implementation of TVM XGB’s search [Chen et al. 2018] uses 971 lines in three files (xgboost_tuner.py, xgboost_cost_model.py, and sa_model_optimizer.py).
Adaptability: Droplet Search works in any setting where AutoTVM does, including settings like a Cortex A7, where Ansor cannot be used to optimize MobileNet [Howard et al. 2017].
Efficiency: Droplet Search converges more than twice as fast as the different algorithms available in AutoTVM (random sampling, grid search, genetic search, and XGBoost) and usually is more than four times faster than TVM’s AutoScheduler.
Effectiveness: In almost every experiment we ran, Droplet Search delivered kernels that either match or outperform those produced by the other schedulers in AutoTVM. Compared to Ansor, in a universe of 30 experiments, Droplet Search found faster kernels in 11 settings and lost in 12. However, Ansor uses up to four times more transformations.

2 The Search Space

A kernel admits an abstract view, formed by an iteration space, a data space, and a computation constrained into these zones.2 As mentioned in Section 1, this view can be implemented in multiple ways as long as the data dependencies encoded in the kernel’s computation are respected. Each implementation differs in how the iteration space is traversed. This scheduling determines how the kernel’s computation updates the data space. Example 2.1 clarifies these notions.
Example 2.1.
Figure 1 shows an abstract view of the matrix multiplication kernel. The computations performed by the kernel can be indexed by triples \((i, j, k)\) , which form its iteration space. The bounds \(R_A\) , \(C_B\) , and \(C_A\) that delimit this space abide by two constraints. The first, \(C_a = R_B\) , is mandatory for correctness; the second— \(R_A\) is even—we use for the sake of the example. Figure 1 also shows five implementations of the abstract kernel. These implementations produce the same matrix C; however, the order in which the computations occur—the kernel schedule—might vary.
Fig. 1.
Fig. 1. (a) The abstract representation of the matrix multiplication kernel. ((b) and (c)) Canonical schedules of the kernel. ((d)–(f)) Schedules that result from applying loop tiling and loop unrolling with different parameters onto the canonical schedule in (c).
Transformation Vectors. As hinted in Example 2.1, in the context of this article, the implementation of kernels differ concerning code transformations. These transformations are guided by parameters. Example 2.2 shows parameters for two well-known transformations: tiling and unrolling.
Example 2.2.
Figure 1(e) shows the kernel that comes from Figure 1(c) after the application of three instances of tiling: a transformation that partitions the iteration space into smaller regions. In this example, tiling happens along the three axes of the iteration space. The dimensions of the tiling window are 8, 32, and 16 points. Each one of these sizes is an optimization parameter. Figure 1(f) shows the kernel produced after an application of unrolling onto the innermost loop of the kernel in Figure 1(c). The unrolling factor, i.e., the transformation parameter, is 2.
The parameters of code transformations can be organized into transformation vectors. A transformation vector is a tuple whose elements represent these parameters. The order in which these parameters appear in the vector determines the order in which transformations are applied to programs. Thus, if a given code transformation has multiple parameters, then these parameters exist sequentially in the transformation vector. Example 2.3 shows how the parameters seen in Example 2.2 can be represented as transformation vectors.
Example 2.3.
Consider the ordered application of the two loop-related compiler optimizations mentioned in Example 2.2 onto the abstract kernel in Figure 1(a): tiling along dimensions A, B, and C and unrolling on dimensions A, B, and C. The vectors representing any application of these code transformations have the format: \(\langle \mathtt {tile}_A:t_A, \mathtt {tile}_B:t_B, \mathtt {tile}_C:t_C, \mathtt {roll}_A:r_A, \mathtt {roll}_B:r_B, \mathtt {roll}_C:r_C \rangle\) . Figure 2 shows the vectors that produce the kernels in Figure 1.
Fig. 2.
Fig. 2. The transformation vectors that produce kernels in Figure 1.
The iteration space of an abstract kernel can be traversed in many ways. If we ascribe one loop per dimension of the iteration space, then any ordering of these loops that respects data dependencies is a valid traversal. The implementation of any such ordering, subject to a null transformation vector (the vector representing the absence of transformations), is called a canonical schedule. A kernel may have more than one canonical schedule, as Example 2.4 illustrates. Canonical schedules lead to Kernel Transformation Spaces, which Definition 2.5 formalizes.
Example 2.4.
Figure 1(b) and (c) show canonical schedules for the abstract kernel in Figure 1(a). Any permutations of the loops in Lines 03–05 is a valid canonical schedule. Canonical schedules, by definition, are not subject to code transformations: They represent the application of a null transformation vector. As a consequence, they do not show the effects of typical compiler optimizations, such as loop invariant code motion. Thus, the invariant code in Line 06 of Figure 1(b) remains inside the innermost loop.
Definition 2.5 (Kernel Transformation Space).
Let \(\langle \mathcal {P}_1:p_1, \mathcal {P}_2:p_2, \ldots , \mathcal {P}_n:p_n \rangle\) be a transformation vector, such that each parameter \(p_i\) comes from a range of parameters \(\mathcal {P}_i\) . The Cartesian product \(\mathcal {P}_1 \times \mathcal {P}_2 \times \ldots \times \mathcal {P}_n\) plus a canonical kernel \(\mathcal {K}\) form a kernel transformation space. Each point of this space represents a configuration of \(\mathcal {K}\) transformed by an instance of that Cartesian product. \(\mathcal {K}\) is called the origin of this space and the set of transformation parameters is called the basis of this space. If the basis contains n parameters, then the space is called n-dimensional.
Example 2.6.
The vectors in Example 2.3, plus the canonical kernel in Figure 1(c), form a six-dimensional transformation space. The origin of this space is the kernel in Figure 1(c), and its basis is formed by three parameters of loop tiling and three parameters of loop unrolling.
Not every sequence of code transformations is valid. For instance, unrolling a loop by a factor larger than that loop’s trip count is not meaningful. Thus, Definition 2.5 implicitly assumes that the transformation space is produced by valid transformation vectors. Furthermore, the set of kernels determined by Definition 2.5 is non-exhaustive, even under bounded parameters. Non-exhaustiveness follows from two simplifications that we enumerate below, which are assumed in the construction of the transformation space. These simplifications are also adopted in AutoTVM [Chen et al. 2018] and have been incorporated into further work inspired by it [Zhang et al. 2021] as follows:
(1)
Code transformations are applied to the canonical kernel always in a fixed order. In other words, it is possible that by varying the order in which transformations are applied, the space could be extended with new concrete kernels.
(2)
Every point in the transformation space comes directly from the canonical kernel via the application of one transformation vector. Notice that successive applications of optimizations could, in principle, produce kernels outside the transformation space.
The performance of the kernels that form the transformation space might vary. These variations depend on the scheduling, on the target architecture, and on the dimensions of the iteration and data spaces. The problem of finding the best implementation of a kernel, given these constraints, is known as the Kernel Scheduling Problem, a notion extensively discussed in previous work (see Section 5 for references). To keep this article self-contained, Definition 2.7 restates this concept. Search strategies for Kernel Scheduling are key to optimize deep learning models. Thus, Definition 2.7 is a well-researched problem. Example 2.8 revisits some of these techniques.
Definition 2.7 (Kernel Scheduling Problem).
Given a computing device, a transformation space with origin \(\mathcal {K}\) and basis ( \(\mathcal {P}_1, \ldots , \mathcal {P}_n\) ), plus valid inputs for \(\mathcal {K}\) , kernel scheduling asks for the fastest implementation of \(\mathcal {K}\) in the transformation space.
Example 2.8.
Figure 3 provides a pictographic metaphor for different search techniques for the kernel scheduling problem implemented in AutoTVM and contrasts them with the Droplet Search method that this article introduces. Section 3 shall describe the droplet algorithm. In regards to the other methodologies, the search proceeds as follows:
Grid: This explores a bounded region of the transformation space. Regular ranges of transformation parameters determine this region.
Random: Points of the transformation space are sampled randomly. Sampling usually follows a uniform distribution on predefined bounds placed onto the parameters.
Annealing: Simulated annealing [Kirkpatrick et al. 1988] is a refinement of random search, where sampling alternates between regions that are close and distant from current best points. Points within a neighborhood are sampled, and, from time to time, the center of this neighborhood changes. The probability of such large jumps happening decreases with time.
Fig. 3.
Fig. 3. Pictographic metaphors for different search algorithms. ((a) and (b)) Grid search and Random sampling: Every configuration within the gray area—and only those—will be evaluated. (c) Simulated Annealing: Solid lines represent moves that go “downhill,” i.e., toward faster kernel configurations; dashed edges represent moves that go “uphill,” i.e., which accept to explore slower configurations to escape from local minima. (d) Droplet Search: The darker the color of a configuration, the faster the running time of that configuration.

3 Droplet Search

The Droplet Search is a greedy heuristic to explore the kernel transformation space. The proposed technique is a variation of the Coordinate Descent Algorithm,3 with extensions proposed by Richtárik and Takác [2012] to enable synchronous parallelism. Figure 4 provides an overview of the search algorithm evaluated in this article, and Figure 3(d) provides a graphical metaphor to illustrate how the algorithm works: At each step—up to a maximum fixed number of iterations—search chooses the most profitable kernel configuration that is considered a “neighbor” of the current best configuration. The rest of this section discusses each part of this algorithm shown in Figure 4.
Fig. 4.
Fig. 4. Droplet Search, a Coordinate Descent variation proposed in this article to solve the Scheduling Problem from Definition 2.7.
The successful application of Coordinate Descent depends on a suitable “neighborhood function,” subject of Section 3.1. Search tends to converge to a global optimal, based on the intuition that Section 3.2 provides. Convergence is based on criteria discussed in Section 3.3. This closes with discussions of two optimizations that speed search up: parallelization and speculation (Section 3.4).

3.1 The Neighborhood Function

Any code transformation used in this article is parameterized by one positive integer. Example 2.2 describes some of these parameters. This assumption—transformations defined by one positive integer—forces a total ordering over different instances of a code transformation. The domain of a code transformation is the set of all the values of its parameters. This article works only with numeric domains; however, a domain does not need to be contiguous, as Example 3.1 shows.
Example 3.1.
The list below shows different compiler optimizations and the ordering between their instances. The example takes liberties for the sake of the illustration; i.e., vector lengths are architecture dependent, and not every size might be available, as follows:
Peeling, with parameter \(\mathtt {peel}=n, n \in [0, 1, 2, \dots , \mathtt {maxp}]\) being the number of iterations to peel. The set of values \(\lbrace 0, 1, \ldots , \mathtt {maxp}\rbrace\) is the domain of the transformation.
Unrolling, with parameter \(\mathtt {roll}=n, n \in [0, 1, 2, \dots , \mathtt {maxr}]\) being the unrolling factor.
Thread blocking, (in graphics-processing units) with parameter \(\mathtt {thread}=n, n \in [0, 1, 2, \dots ,\) \(\mathtt {maxt}]\) being the number of GPU threads per block.
Tiling, with parameter \(\mathtt {tile}=n, n \in [N/20, N/15, N/12, N/10, N/6, N/4, N/3, N/2]\) , where N is the size of the iteration space. We assume that N is a multiple of 60. Notice that this example chooses perfect divisors of the iteration space, but different ranges are possible.
Vectorization, with parameter \(\mathtt {vect}=n, n \in [0, 2, 4, 8, 16]\) being the vector length.
In any of the above examples, we say that an instance of a transformation is less than another instance based on the ordering between their parameters. For example, if we consider vectorization ( \(\mathtt {vect}\) ), then \(\mathtt {vect}=m \lt \mathtt {vect}=n\) if, and only if, \(m \lt n\) .
The order between parameters leads to a notion of neighborhood, formalized as follows:
Definition 3.2 (Neighborhood).
Consider two transformation vectors of an n-dimensional space: \(v_1 = \langle \mathcal {P}_1:p_1, \ldots , \mathcal {P}_{i-1}:p_{i-1}, \mathcal {P}_i:x, \mathcal {P}_{i+1}:p_{i+1}, \ldots , \mathcal {P}_n:p_n \rangle\) ; and \(v_2 = \langle \mathcal {P}_1:p_1, \ldots , \mathcal {P}_{i-1}:p_{i-1}, \mathcal {P}_i:y, \mathcal {P}_{i+1}:p_{i+1}, \ldots , \mathcal {P}_n:p_n \rangle\) , such that each p is a transformation parameter within range \(\mathcal {P}\) . Vectors \(v_1\) and \(v_2\) are neighbors along dimension \(i, 1 \le i \le n\) , if they differ only on dimension i and if
(1)
\(x \lt y\) (without loss of generality);
(2)
\(\forall z \in \mathcal {P}_i, z \ne x\) , if \(z \lt y\) , then \(z \lt x\) .
(3)
\(\forall z \in \mathcal {P}_i, z \ne y\) , if \(z \gt x\) , then \(z \gt y\) .
If two vectors are neighbors along one dimension, then they are called neighbors. The set of every neighbor of a given transformation vector is called the neighborhood of that vector.
Each dimension of a transformation vector v is considered separately when determining the neighbors of v. There exist at most two neighbors per transformation parameter; therefore, the number of neighbors of v grows linearly with the number of its dimensions. This observation is essential for scalability: If the neighborhood of a vector considered variations in two or more dimensions, then the size of the neighborhood would be exponential on the number of dimensions.
Example 3.3.
Consider the following transformation vector: \(v = \langle \mathtt {tile}_A: 0, \mathtt {tile}_B:8, \mathtt {tile}_C:16, \mathtt {roll}_A:4, \mathtt {roll}_B:0, \mathtt {roll}_C:0\rangle\) , which applies onto the canonical kernel in Figure 1(c). If we let \(\mathtt {roll}_A = [2, 4, 8, 16, 24]\) , then v has two neighbors along dimension \(\mathtt {roll}_A\) . These neighbors are \(v_1 = \langle \ldots , \mathtt {roll}_A:2, \ldots \rangle\) and \(v_2 = \langle \ldots , \mathtt {roll}_A:8, \ldots \rangle\) .

3.2 Convexity

The goal of this article is to find implementations for abstract kernels that minimize their running times. To this purpose, we define the running time function \(\mathit {RT}\) as follows:
Definition 3.4 (The Running Time Function).
Given (i) a canonical Kernel \(\mathcal {K}\) , (ii) a transformation vector v with parameters \(\mathcal {P}\) , (iii) a computer architecture A, and (iv) input data I for the canonical kernel, we define the running time function \(\mathit {RT}_{A,I,\mathcal {K}}(v) : \mathcal {P} \mapsto \mathbb {R}\) as a function that maps the implementation of \(\mathcal {K}\) optimized by v to the time it takes to process input I on target A.
In practice, \(\mathit {RT}\) is obtained by running the kernel configurations. The notion of neighborhood makes the transformation space a coordinate space: It is possible to define the distance between two transformation vectors. The running time function \(\mathit {RT}\) divides this coordinate space into two halves: If \(\mathit {RT}_{A,I,\mathcal {K}}(v) = t\) , then we have a region formed by \(t_{\mathit {lower}} \lt t\) and a region formed by \(t_{\mathit {higher}} \ge t\) . Thus, \(\mathit {RT}_{A,I,\mathcal {K}}\) determines a hypersurface onto the transformation space. This hypersurface is convex if its local minima are equal to its global minimum. We say that vector v is a minima of \(\mathit {RT}_{A,I,\mathcal {K}}\) if \(\mathit {RT}_{A,I,\mathcal {K}}(v) \lt \mathit {RT}_{A,I,\mathcal {K}}(v^{\prime })\) for any \(v^{\prime }\) within the neighborhood of v. A global minimum is the smallest local minimum in a set. Example 3.5 illustrates these concepts.
Example 3.5.
Figure 5 shows hypersurfaces produced by two running time functions. Each function is parameterized by a different architecture and different inputs (the dimensions of the iteration space). The functions are parameterized by the same canonical kernel—the configuration in Figure 1(b). Two parameters form the transformation space: \(\mathtt {tile}_A \in [0, 20, 40, \ldots , 120]\) , and \(\mathtt {tile}_B \in [0, 20, 40, \ldots , 120]\) . The figure shows, for each running time function, its optimal configuration. The figure also shows at least one local minimum that differs from the optimum. These hypersurfaces are not convex, because they contain local minima that are not globally optimal.
Fig. 5.
Fig. 5. Performance variation for the kernel in Figure 1(b), considering the following parameters of the iteration space: (left) \(A = 1,000, B = 800, C = 700\) and (right) \(A = 3,500, B = 1,400, C = 2,400\) . White pins show optimal tiling configurations, and gray pins show points of local minima that are not globally optimal.
The hypersurfaces in Figure 5 are not convex. However, they have the following property: The origin and the global optimum belong to the same convex region. This property remains true, at least for the two settings in Figure 5, if we add more dimensions to the transformation space considered in Example 3.5, such as an extra tiling window, unrolling of the innermost loop, or interchange of any pair of loops. Definition 3.6 states this observation as our working hypothesis.
Definition 3.6 (The Droplet Expectation).
Let \(v_{\mathit {opt}}\) be the optimum kernel of the running time function \(\mathit {RT}_{A,I,\mathcal {K}}\) . We expect \(v_{\mathit {opt}}\) and \(\mathcal {K}\) , the unoptimized kernel, to belong into a a convex contiguous subset of the hypersurface determined by \(\mathit {RT}_{A,I,\mathcal {K}}\) . Hence, we expect the existence of a chain of neighbor vectors \(v_0, v_1, v_2, \ldots , v_{n-1}, v_n\) , where \(\mathcal {K} = v_0\) and \(v_n = v_{\mathit {opt}}\) , with the following properties:
Contiguous chain: \(v_i \in \mathit {neighborhood}(v_{i-1}), 0 \lt i \le n\) ; and
Descending chain: \(\mathit {RT}(v_i) \lt \mathit {RT}(v_{i-1}), 0 \lt i \le n\) .
A continuous hypersurface can be partitioned into convex regions: maximal convex sets formed by the transitive closure of the neighborhood function. Whenever the Droplet Expectation is confirmed, the origin and global optimum belong to the same convex region. Thus, there exists a “downhill path” from origin to optimum that can be found by Coordinate Descent (assuming an idealized running time function without statistical variations). If the Droplet Expectation fails, then the origin and global optimal belong into distinct convex regions. Figure 6 illustrates these ideas.
Fig. 6.
Fig. 6. (a) Downhill path from origin to global optimum on a three-dimensional hypersurface. (b) Expectation holds: Origin and global optimum belong to the same convex region. (c) Expectation fails: Origin and global optimum belong to different convex regions. Configuration \(v_3\) is a local minimum but not a global optimum.
Intuition. The hypothesis stated in Definition 3.6 is not a certainty: It is possible to disprove it with analytical models involving discontinuous functions, as Section 4.4 shows. However, Section 4 demonstrates that the hypothesis holds on a variety of architectures and models. Indeed, the effect of many compiler optimizations can be described by polynomial equations involving only positive coefficients ranging on positive domains. The second derivative of such functions, if they exist, will be always positive. This condition is sufficient to ensure convexity [Bertsekas 2009, Pro.1.1.10]. In the words of Renganarayana and Rajopadhye [2008], “the use of polynomial functions with this property leads to convex optimization problems which can be solved for real solutions in polynomial time.” Renganarayana and Rajopadhye call this property Positivity. Notice that the ordering of the various optimization levels along each dimension of the search space is primarily responsible for making the droplet expectation hold. For instance, going back to Example 3.5, the expansion of the tiling window might be beneficial for performance until the number of elements in this window overgrows the size of the L1 cache. After this point, any further increase will cause cache misses.

3.3 Stop Criterion

The existence of the convex path expected in Definition 3.6 does not ensure that such a path can be discovered via a Coordinate Descent search procedure. Running time has a stochastic nature: This nature implies that the actual evaluation of \(\mathit {RT}_{A,I,\mathcal {K}}\) on a kernel produced by a transformation vector v is prone to variations. Thus, Coordinate Descent might reach suboptimal configurations that are apparently optimal due to measurement fluctuations on \(\mathit {RT}_{A,I,\mathcal {K}}(v)\) .
To increase the reliability of Coordinate Descent, multiple evaluations of each point of the transformation space are in order. However, each further evaluation contributes to increasing the total time necessary to solve scheduling. The implementation of Droplet Search that we analyze in Section 4 performs three evaluations of each point visited in the process of solving scheduling. Three evaluations are the default sampling procedure adopted by AutoTVM. We use Student’s t-test (following Levine [1969]’s implementation) with significance level \(\alpha = 0.05\) to compare the two populations of three samples each.4 In other words, the search stops if we reach a transformation vector v with no neighbor that yields a faster kernel configuration with a confidence level of 95%.

3.4 Synchronous Parallelism and Speculation

The algorithm in Figure 4 keeps a current candidate transformation vector, which is initialized with the origin of the optimization space in Line 02 and is updated in Line 13 whenever a faster kernel configuration is found. To find a faster configuration, every point in the neighborhood of the current candidate is considered. The evaluation of these points happens in parallel, as Lines 09 and 10 of Figure 4 shows. However, we have observed that it is often difficult to fill up every available thread with unvisited kernel configurations. Example 3.7 explains how this difficulty emerges.
Example 3.7.
Figure 7 shows how parallel evaluation of kernel configurations happens. Points in the neighborhood of the current candidate are evaluated in synchronous batches. Figure 7 represents these points as gray boxes. Usually, there will be a non-empty intersection of configurations between the neighborhood of the current candidate and the next candidate. Intersecting points will be stored in the visited set seen in Figure 4. In this example, intersecting points are marked with gray numbers in Figure 7(b) and (c). Overlaps reduce parallelism: Only five points are evaluated concurrently; thus, resources will be underutilized in processors with more than five cores.
Fig. 7.
Fig. 7. ((a)–(c)) Progress of Droplet Search without speculation. Each number shows the order in which points are evaluated. Gray boxes denote points currently evaluated. Gray numbers show points in the neighborhood of the current candidate that was already evaluated. (d) Differential speculation: a subset of the extended neighborhood of the current candidate is evaluated to maximize thread occupancy.
Differential Speculation. We resort to speculation to maximize thread occupancy. Speculation, in the context of this article, is the evaluation of kernel configurations in the extended neighborhood of the current candidate. The extended neighborhood is formed by the neighbors of neighbors. The algorithm in Figure 4 determines these extra points—the Speculative Set—in Line 07. This set is built via differential speculation; the history of previous coordinates of the best candidates determines a speculative set as follows: Let \(v_{n-1}\) be the candidate at iteration \(n-1\) of Coordinate Descent, and let \(v_n\) be the candidate at iteration n. By Definition 3.2, \(v_{n-1}\) and \(v_n\) differ in only one dimension, e.g., \(v_{n-1} = \langle \ldots , \mathcal {P}: x_{n-1}, \ldots \rangle\) and \(v_n = \langle \ldots , \mathcal {P}: x_n, \ldots \rangle\) . We choose a new transformation vector \(v_s = \langle \ldots , \mathcal {P}: x_n + s, \ldots \rangle\) , where \(x_n + s\) is the value that follows \(x_n\) within the parameter \(\mathcal {P}\) . By the nature of Coordinate Descent, it is likely that \(v_s\) is not in the visited set. Nevertheless, if the centroid of the speculative set is in the visited set, then Line 07 of Figure 4 still ensures that multiple evaluations will not happen. Given \(v_s\) , we let the speculative set be its neighborhood. Line 07 of Figure 4 adds this set to the batch of configurations waiting to be evaluated.
Example 3.8.
Figure 7(d) provides some idea of how speculation works. We let S be the centroid of a speculative set. This kernel configuration, S, is chosen by extending the overall direction of the Coordinate Descent path speculatively: an action represented by the longer arrow in Figure 7(d). In this example, speculation increases the number of active threads from 5 to 10.
Even though we restrict speculation to the extended neighborhood of the current candidate, this technique can still cause the line search of Coordinate Descent to leave a convex region. However, Section 4.5 shows that such event is not happening in practice. Had we adopted larger speculation steps, i.e., outside the extended neighborhood of the current candidate, then the risk of leaving the convex region would be higher.

4 Evaluation

This section compares Droplet Search with similar techniques employed in the implementation of Apache TVM. To this effect, this section investigates the following research questions:
RQ1: Can we demonstrate the Droplet Expectation in different architectures, at least for transformation vectors involving a small number of dimensions?
RQ2: How effective is Droplet Search on end-to-end models compared to other search techniques on different computer architectures?
RQ3: What is the average number of samples that Droplet Search takes to converge to optimal results, compared to other search techniques?
RQ4: Does the Droplet Search converge to a global optimum when applied to an industrial-quality analytical cost model?
RQ5: How does the number of threads used in Droplet Search influence the convergence time of the algorithm and the quality of the model that it finds, with and without speculation?
RQ6: How do kernels tuned via AutoTVM and Ansor compare to hand-written implementations of kernels in TensorFlow?
RQ7: How does the behavior of Droplet Search vary, in terms of quality and speed, on each individual kernel that constitutes a machine-learning model?
RQ8: What is the impact of the confidence level on the convergence rate of Droplet Search in terms of search speed and quality of the final model?
Hardware. This section evaluates scheduling approaches on six computer architectures, which Figure 8 enumerates. This mix contains two general-purpose desktop architectures (AMD and Intel), two embedded system-on-chips (ARM), and two graphics processing units (NVIDIA).
Fig. 8.
Fig. 8. The architectures evaluated in this article.
Software. This section uses Apache TVM v0.11.1, released in March 2023. We implemented Droplet Search as part of AutoTVM. AutoTVM also provides four other search techniques: grid, random, genetic (GA), and simulated annealing (XGB). XGB is described by Chen et al. [2018]. This technique uses a cost model to guide simulated annealing. The constants that define this cost model are learned during a training phase. We also compare our implementation of Droplet Search with AutoScheduler (Ansor) [Zheng et al. 2020], also available in Apache TVM v0.11.1.
Benchmarks. This section evaluates kernel scheduling on five convolutional neural networks and on one encoder stack of transformers (BERT). Figure 9 enumerates the models. The schedulers in AutoTVM, including Droplet Search, use the same optimization parameters. We have not chosen these parameters: They are pre-defined (per model) in the distribution of AutoTVM. Optimization parameters differ depending on the computer architecture. In the CPUs, these parameters refer to two optimizations: tiling and unrolling. In the GPUs, they concern two more: thread blocking (the ability to partition CUDA threads into blocks) and shared memory tiling (the ability to bring data from global to shared memory). All the parameters are divisors of boundaries of the iteration space. For instance, the size of a tile window must be a divisor of the size of the iteration space along the tiled dimension. This fact removes discontinuities from the search space (code that the compiler inserts to handle boundary conditions), as Section 4.4 will explain.
Fig. 9.
Fig. 9. The machine-learning models evaluated in this article. “S” is the number of possible configurations formed by a given number “P” of transformation parameters. The largest neighborhood explored by Droplet Search at any time contains \(2\times \mbox{P} + 1\) points. “K” is the number of kernels that form each model. The implementation of the models vary according to the architecture. TVM’s implementation of MXNet does not run on the ARM boards.
Ansor uses more optimizations than AutoTVM. Although AutoTVM is restricted to a single transformation vector (in TVM’s parlance, a kernel template), Ansor creates several of them. The extra templates effectively give Ansor access to inter-kernel optimizations, such as loop fusion and fission: It can merge operators (i.e., kernels) in the computational graph, for instance. Figure 9 shows the number of parameters in the largest template that Ansor explores for each network. The optimizers used in AutoTVM—Droplet Search included—are restricted to intra-kernel optimizations. Nevertheless, we hope to demonstrate that even accessing a smaller pool of optimizations, Droplet Search can be competitive with Ansor, which resorts to more extensive code transformations.

4.1 RQ1: The Droplet Expectation

Definition 3.6 specifies that a behavior likely to characterize typical implementations of linear kernels. This expectation is not a guarantee; thus, while we anticipate to find a path from origin to optimum along which performance improves gradually, such a path might not exist. In this case, Droplet Search will be stuck on a local minimum. Nevertheless, this section provides some empirical evidence that the Droplet Expectation holds. Section 4.2 provides further evidence: The best kernel configurations found by Coordinate Descent are similar to the best configurations found via more extensive techniques, even when speculation is enabled.
Methodology. We have analyzed the behavior of two kernels: matrix multiplication and convolution on six architectures. In this experiment, we use two-dimensional spaces for the sake of visualization. On CPUs (ARM or AMD), the transformation space is formed by the tiling of the two innermost loops of each kernel. Tile sizes, in both dimensions, are \(0, 8,16,24, \ldots ,128\) ; hence, we have a \(17 \times 17\) transformation space. On GPUs, the transformation space is formed by the number of threads in two-dimensional thread blocks. The possible number of threads are \(1, 2, 4, \ldots , 32\) , also forming a \(17 \times 17\) transformation space. Convolution uses a \(1{,}024 \times 1{,}024\) matrix with a \(3 \times 3\) filter. Multiplication uses the following matrix sizes: \(1{,}000 \times 700\) , \(700 \times 800\) , and \(1{,}000 \times 800\) .
Fig. 10.
Fig. 10. Visual representation of nine different \(17 \times 17\) transformation spaces, showing the path traversed by Coordinate Descent from origin to global optimum. Matrix multiplication is mm2d, and convolution is conv3. Performance improves from green to red. Black cells denote invalid configurations. We show, on top of each grid, the nesting order of loops, where the canonical configuration starts with the nested sequence ijk.
Discussion. The Droplet Expectation holds in every one of these \(2 \times 6\) scenarios. Figure 10 shows representations for nine of these spaces, highlighting the path taken by Coordinate Descent. Some configurations involving thread blocking on the GPU are not valid; thus, they are not evaluated. In every scenario, the optimal kernel configuration is close to the origin, as the paths in Figure 10 emphasize. The Droplet Expectation holds if we increase the number of dimensions in the transformation space or the number of kernels in the model. Section 4.2 discusses this experiment.

4.2 RQ2:- End-to-end Effectiveness

The goal of any kernel scheduling technique is to find efficient implementations of computational graphs involving kernels. This section investigates how Droplet Search fares in such task, compared with other scheduling techniques, when optimizing well-known machine learning models.
Methodology. We evaluate six end-to-end models on six architectures, with a hard limit of 10,000 evaluations per search technique. This arbitrary limit prevents experiments from running for too long. The Droplet Search will stop after evaluating 10,000 kernel configurations; however, it tends to stabilize earlier, as discussed in Section 4.8. The other search techniques do not have a notion of premature convergence; hence, they evaluate 10,000 configurations. Notice that Figure 9 shows that every transformation space contains more than 10,000 configurations. All the search algorithms evaluate kernel configurations in parallel. The parallelism in Droplet Search follows the ideas from Section 3.4. We use a confidence interval of 95% when comparing a candidate kernel configuration with kernels within its neighborhood. Section 4.3 discusses the impact of this last choice.
Fig. 11.
Fig. 11. Comparison of kernel scheduling techniques. “Model (ms)” is the running time of the best schedule for a given model. “Search (min)” is the time to find that configuration. Light-gray boxes denote statistically similar results (with a confidence level of 95%). Black boxes denote statistically significant best kernel times. Gray boxes with white fonts denote statistically significant best search times. Double borders denote the best schedule produced by an algorithm from AutoTVM only (i.e., Ansor is not considered).
Discussion. Figure 11 compares the effectiveness of different search techniques. Considering only the search techniques in AutoTVM, Droplet Search generally yields the best kernels or ties for the best (usually with XGB). Its search time, however, is faster. On the AMD CPU, for instance, Droplet Search optimizes VGG-16 in 10% of the time the other scheduling approaches require. Ansor tends to produce faster kernels than the techniques implemented in AutoTVM, including Droplet Search. Ansor explores more optimization parameters: It has access to a large number of kernel templates, whereas AutoTVM uses only one. Nevertheless, due to excessive memory consumption, Ansor (and also AutoTVM’s XGB) could not be used to schedule our largest models in the ARM boards. We have observed that Droplet Search does not perform well on the GPUs. In this case, the size of the search space (from 200M to 2.7T configurations, as seen in Figure 9) forces Coordinate Descent to converge due to the limit of iterations in every model, except ResNet-18. Incidentally, on ResNet-18 Droplet Search produces faster kernels than Ansor in any GPU.

4.3 RQ3: Stop Criteria

As Section 3.3 explains, our current implementation of Droplet Search stops once it reaches a candidate kernel configuration faster than all its neighbors. Statistical significance between runtime differences is determined via Student’s t-test applied over two populations consisting of three samples each, with a confidence level of 95%. Yet, measurements might fluctuate, and premature termination is possible. If we tighten the confidence level, then termination might happen too early. If we lose it, then convergence can take too long, and Coordinate Descent might visit configurations that are not statistically significantly faster. This section investigates how Droplet Search fares once we vary the confidence level for comparing kernel configurations.
Methodology. We evaluate the five deep-learning models listed in Figure 9 on the Intel i7 using five different levels of confidence: 99%, 95%, 90%, 75%, and no test. In the latter case, we use the absolute arithmetic average of three samples to determine which kernel is faster.
Fig. 12.
Fig. 12. Effect of confidence level (the stop criterion of Section 3.3) on the quality of the best kernel configuration and on the running time of Droplet Search.
Discussion. We could observe almost no variation in the quality of the best kernel configuration depending on the confidence level. This result seems to indicate, at least for the five models running on the Intel i7 CPU, that there exists a number of “acceptable best” kernels with very similar dynamic behavior. However, the search time increases—albeit slightly—once we move from high confidence levels toward low confidence levels. This growth in search time happens, because more kernel configurations tend to be visited by the search procedure.

4.4 RQ4: Analytical Models

Analytical cost models are systems of equations that predict the cost of a program (CPU cycles, I/O operations, cache misses, etc.) given a model of the hardware. Recently, different research groups have designed analytical cost models to estimate the performance of machine-learning kernels [Olivry et al. 2021; 2020, Sumitani et al. 2023, Tollenaere et al. 2023, Zhang et al. 2021]. This section investigates if the droplet expectation also holds in some of these models.
Methodology. This section evaluates cost models taken from three different sources. The first models were proposed by Olivry et al. [2021]. They estimate a lower bound for the amount of data movement between slow and fast memories. The second model is part of the Xtensa Neural Network Compiler (XNNC), from Cadence Tensilica, and was made available to us through the Cadence Academic Network.5 This model estimates the number of execution cycles that a program compiled by XNNC takes to execute on Cadence DSPs. Finally, we analyze the eight analytical models listed in Table II of Renganarayana and Rajopadhye [2008].

4.4.1 Olivry et al.’s Cost Models.

We evaluate the Olivry et al.’s model on the tiled version of matrix–matrix multiplication in the authors’ original work (Listing 1 [Olivry et al. 2021]). This program is the default example in Olivry et al.’s online tool.6 We evaluate it in a system with two caches, with 64 and 256 KB—the dimensions of the first two cache levels used in the Intel i7.
Discussion. Figure 13 shows hypersurfaces produced by the cost models generated via Olivry et al.’s online tool. Each figure relates the size of a bidimensional tiling window with the I/O cost in terms of memory transfers. The lower the cost, the faster the program is expected to run. Figure 13 (a) assumes one level of cache (with 64 KB). The other two figures assume two levels (64 K and 256 KB). Figure 13(b) and (c) are similar to the surfaces seen in Figure 5, which explore the same program, albeit on an actual machine. In the three figures, the droplet expectation (Definition 3.6) holds. This result is not a coincidence: Olivry et al.’s cost model involve only positive quantities (domain and coefficients) and, thus, form convex surfaces.
Fig. 13.
Fig. 13. Amount of memory transfers produced by Olivry et al. [2021]’s cost model applied onto Listing 1 in Olivry et al.’s work: a tiled version of the kernel in Figure 1(b) in this article. We vary tiling dimensions Ti_0, Ti_1, and Tk_0. The input matrices have sides Ni = Nj = Nk = 10K.

4.4.2 Tensilica’s Cost Models.

We evaluate the Tensilica model on an implementation of the tiled ReLU kernel on two digital signal processors, called P1 and P6. We chose these two processors, because they are used as tests in the Tensilica tool. The DSPs do not have a cache; however, they contain local memory and system memory. Hence, the compiler must implement direct memory access (DMA) transfers from the system to local memory and tiling reduces DMA operations.
Discussion. In contrast to Olivry et al.’s cost model, the equations produced by XNNC take into consideration the fact that the tiling window might not be a perfect divisor of the loop’s iteration space. If tiling does not perfectly divide the iteration space, then the XNNC compiler generates epilogue code to fetch data outside the tiled loop. Consequently, the performance model contains conditionals. For instance, Figure 14(a) was produced by a (simplified) equation like \(\mathtt {cost} = C_1 + (\mathtt {if} \ W\%63 \ \mathtt {then} \ C_2 \ \mathtt {else} \ 0) + C_3 \times (H/64)\) . The coefficients \(C_i\) represent costs of particular instructions; W and H are tile sizes. Due to these conditionals, the cost models are represented by discontinuous functions. In this case, the droplet expectation does not hold. However, if we restrict valid neighborhoods to only perfect divisors of the iteration space, then the Droplet Expectation holds. For instance, starting with tiling windows with size 16 or greater, coordinate descent achieves the optimal configuration in any of the surfaces seen in Figure 14 after three iterations. Notice that this restriction is unnecessary if some loss is acceptable. When applied to large deep-learning networks—under the same XNNC analytical model—Droplet Search stays very close to the global optimum, although sampling less than 1% of the space covered by an exhaustive grid search.
Fig. 14.
Fig. 14. Part of the tiling search space generated for a ReLU layer by the XNNC’s analytical cost model for the P1 digital signal processor. The higher the cost (yellower), the slower the kernel is expected to behave, when deployed onto the actual hardware. To emphasize the non-convex regions, we show only the first 16 sizes of possible tiling windows.

4.4.3 Renganarayana and Rajopadhye’s Cost Models.

Renganarayana and Rajopadhye [2008] show that models used to solve the “Tile Size Selection problem” are represented by equations whose coefficients and domains are all positive quantities. To support their observation, they list equations taken from eight different analytical models from previous work. These equations all represent instances of the bidimensional tile-size selection problem. They use variables that range on the following quantities:
C
: the size of the cache in the target computer architecture;
L
: the length of the cache line;
h
: the height of the rectangular tiling window;
w
: the width of the rectangular tiling window; and
n
: the side of an \(\mathbf {n} \times \mathbf {n}\) array.
In this section, we fix C, L, and n, and plot the hypersurface formed by h and w, within a contiguous range of values. This approach simulates Apache TVM’s grid search algorithm.
Discussion. Figure 15 shows hypersurfaces for the different equations analyzed by Renganarayana and Rajopadhye. To ease visualization, we remove the h and w axes (all ranging on the interval \([1, \ldots , 100]\) ). The Droplet Expectation holds in all these analytical models. In fact, all of these equations use exclusively positive coefficients, hence, yielding convex surfaces.
Fig. 15.
Fig. 15. Hypersurfaces formed by the models in Table II of Renganarayana and Rajopadhye [2008].

4.5 RQ5; Parallelism and Speculation

As explained in Section 3.4, the implementation of Droplet Search evaluated in this article uses synchronous parallelism and speculation to speed convergence up. This section evaluates the effects of these techniques on search time and on the quality of kernel configurations.
Methodology. We evaluate Droplet Search on the five different end-to-end models on the AMD 3700X CPU. This CPU runs 16 threads (two threads per core, with eight cores). This section experiments with three versions of Droplet Search: its full implementation, an implementation that features parallelism but no speculation, and an implementation that runs on a single thread. We report p-values comparing three executions of the fastest models found with each approach.
Fig. 16.
Fig. 16. (Left) Number of iterations until convergence of variations of Droplet Search. (Middle) Running time of different implementations of Droplet Search. (Right) The p-values produced by a t-test on populations of three executions of the best model found by each technique. The closer to 1.0 is the p-value, the more statistically similar are the two populations.
Discussion. Figure 16 compares the different implementations of Droplet Search. Its single-threaded implementation takes 664 s to converge (sum of tuning time over five networks). The parallel version, with 16 threads, converges in 358 s. With speculation plus parallelism, this time goes down to 291 s. Parallelism is a known limitation of Coordinate Descent. Our implementation suffers from the shortcomings mentioned by Zheng et al. [2000]. The synchronous nature of coordinate descent limits concurrency: The best candidate is chosen after every point of a neighborhood is evaluated. Thus, progress only happens once the slowest point runs. Wang et al. [2016] have shown that it is possible to improve parallelism if more candidate points co-exist. We believe that early preemption of slow points could also speed our implementation up: Once the best candidate is found in a neighborhood, the other threads can be aborted. We leave such approaches—multiple candidates and early preemption—open for future work. Speculation improves the running time of our implementation of Droplet Search by a small margin. The parallel version of Droplet Search, with speculation, is 27% faster than the non-speculative implementation (geomean over speedups). This gain comes mostly from faster convergence.
Figure 16 (right) shows that the extended neighborhood explored via speculation has no effect on the speed of the kernel configurations found via Droplet Search. The search does not find always the same final configuration for every layer of every model; however, this phenomenon is due just to statistical variations in the running time of similar kernels. As the figure shows, the p-values reported by a t-test on the speed of the different kernels is well above 0.05. Thus, the extended neighborhood is not changing the behavior of our implementation of Droplet Search.

4.6 RQ6: Comparison with TensorFlow

The goal of this section is to bring some perspective about our results to readers that are not familiar with the Apache TVM ecosystem. To this end, we shall present a comparison between three different approaches to develop end-to-end models: AutoTVM, Ansor, and TensorFlow [Abadi et al. 2016]. The latter is a Python-based library to write machine learning models. In contrast to AutoTVM or Ansor, TensorFlow does not do, by itself, any form of scheduling: The programmer must feed it with an optimized implementation of a machine learning model.
Methodology. This section compares the different kernel implementation approaches using a benchmark collection formed by six kernels: matrix multiplication, two-dimensional (2D) convolution, depthwise separable convolution, pooling, matrix reduction, and 2D ReLU. These kernels have been taken from the artifact made publicly available by Zhu et al. [2022] and are meant to run on graphics processing units. We evaluate them into our RTX 3080 GPU and on an A6000.7 Incidentally, we have not been able to apply Zhu et al.’s tool onto these very kernels.8
Discussion. Figure 17 shows the comparison between different kernel implementation systems. Notice that Figure 17 compares different kernel implementations. Ansor and AutoTVM (Droplet Search and XGB) receive, as input the same code: a kernel implemented in Python with libraries from Apache TVM. TensorFlow receives a different implementation: kernels also written in Python, but with libraries from the TensorFlow package. As an example, the API to invoke the ReLU kernel is tf.nn.relu(a_tf) in TensorFlow and topi.nn.relu(a_tvm) in Apache TVM. In short, Figure 17 is comparing two different Python libraries.
Fig. 17.
Fig. 17. Comparison between Apache TVM and TensorFlow on the six kernels made available by Zhu et al. [2022], on two graphics processing units. This figure uses the same notation as Figures 11 and 18: Dark boxes indicate the fastest models, and gray boxes indicate the fastest search times. Light gray boxes indicate results that are statistically similar. TensorFlow does not implement search.
In every experiment, Apache TVM has produced faster (or equally faster) kernels than TensorFlow, be it through AutoTVM or Ansor. However, without scheduling, TensorFlow outperforms Apache TVM in two kernels: convolution and depthwise convolution. We have not observed a statistical difference between implementations of the reduction and the ReLU kernels, regardless of the library or the scheduling approach adopted to optimize them. In this regard, we observe that neither Ansor nor AutoTVM implement search for pooling, reduction, and ReLU. The implementation of these kernels, as provided by Zhu et al., does not come with a template of optimization parameters. The other kernels, in contrast, come with templates that enable thread blocking and tiling with shared memory. Loop unrolling is not enabled by these parameters. Nevertheless, Figure 17 confirms some of the results earlier observed by Zhu et al.: Kernels produced by Apache TVM tend to outperform similar kernels produced via TensorFlow. However, contrary to Zhu et al., we have observed smaller differences.

4.7 RQ7: Intra-Kernel Behavior

The models explored in Section 4.2 consist of multiple kernels: Each layer is implemented as an independent kernel. Droplet Search and all the other search techniques available in AutoTVM are intra-kernel. Thus, kernels are optimized independently from each other. This section analyzes the effects of Droplet Search on individual kernels and compares these effects with results obtained by other scheduling techniques. By showing that Droplet Search finds similar configurations as exhaustive techniques, we provide further evidence that the droplet expectation is common.
Methodology. We analyze each kernel of ResNet-18 and VGG-16 in separate, reporting search time and kernel running time. Each kernel is extracted from its encompassing model via TVM’s code generator. ResNet-18 and VGG-16 are our smallest networks, in number of kernels. We restrict this study to only two models, because the individual analysis of each kernel is time-consuming (meaning human time, not machine time). Nevertheless, we believe that these results could be extrapolated to the other models, which sport similar implementations.
Fig. 18.
Fig. 18. The search techniques of AutoTVM applied on individual layers of deep-learning models on the AMD R7-3700X. Evaluations average three samples. Light gray boxes mark running times that are statistically similar with confidence level of 95%. Black boxes denote statistically significant running times. Gray boxes mark statistically significant search times. The sum of search times approximates the results in Figure 11. The sum of kernel times, i.e., “Time (ms),” is strictly less than the running times reported in Figure 11.
Discussion. Figure 18 shows how the search techniques fare on each layer of ResNet-18 and VGG-16. We do not show results for Ansor, because, as Figure 9 shows, Ansor’s implementation recognizes more layers on each model. Droplet Search never yields worse configurations than the other search techniques. Furthermore, its search is faster, converging with fewer samples. In this case, column “iter” in Figure 18 provides the number of kernel configurations evaluated by each search technique. In contrast to Droplet Search, the other approaches used in AutoTVM do not have a notion of convergence. The search stops once a determined number of kernel configurations is visited. However, sampling is not equally divided among the kernels: AutoTVM draws more samples for kernels that run for more time. As a example, the different search procedures of AutoTVM sample Layer Five of ResNet-18 1,024 times, as Figure 18 shows. Droplet Search, in contrast, stops after 120 iterations. The samples that Droplet Search evaluates depends more on the dimensions of the layer, such as the number of channels and width and height of the filter used. The four largest layers in Figure 18 are, in this order: the \(6{\rm th}\) , the \(8{\rm th}\) , the \(9{\rm th}\) , and the \(11{\rm th}\) . Incidentally, these layers account for the largest number of samples observed by Droplet Search.

4.8 RQ8: Convergence Rate

The convergence rate of a search mechanism used to solve the kernel scheduling problem measures how fast that technique closes on its final solution. As mentioned in Section 4.7, the different techniques that AutoTVM uses to solve kernel scheduling iterate until a fixed number of configurations are evaluated. We have observed that it is often possible to stop iterations before, once a sufficiently good configuration is reached. That is the approach adopted in Droplet Search, as Section 3.3 explains. In what follows, we investigate how the quality of the final solution to scheduling improves as the number of evaluations progresses.
Methodology. We set the maximum number of iterations of AutoTVM’s grid, random, genetic, and XGB search to 10,000. This number, 10K, includes the evaluations of kernel configurations in neighborhoods or speculative sets that Droplet Search uses. Ansor shall use the same limit of evaluations. We then inspect the speed of the best kernel configuration that each one of these search techniques find RestNet-18 throughout the search. We show results for ResNet-18 only; however, we have evaluated the convergence rate for the other models. Results tend to be similar.
Fig. 19.
Fig. 19. Quality of the solution for kernel scheduling versus the number of evaluations (trials) used to find that solution for ResNet-18. The blue star marks the convergence of the Droplet Search.
Discussion. Figure 19 shows the results of this experiment. Droplet Search usually converges to the best solution before 10,000 evaluations. The fastest convergence was observed on the Intel CPU: Coordinate Descent stabilized after 1,278 configurations were evaluated. Notice that convergence happened before 2K evaluations in every CPU. The slowest convergence happened on the GTX GPU: 7,125 evaluations. Similarly, on the RTX, 6,502 configurations were evaluated. Similar results were also observed on the other four models: Droplet Search stabilizes before 10K iterations; typically before 3K iterations on the CPUs, and before 8K iterations on the GPUs.

5 Related Work

This article aims to find the best concrete implementation for a given program. We recognize two main approaches to this theme, which we shall call autotuning (e.g., compiler autotuning) and scheduling (e.g., program autotuning, following Tollenaere et al. [2023] taxonomy). The latter is the problem from Definition 2.7. These problems differ in two essential ways:
Training: In autotuning, the compiler is trained offline on many programs—its training set—before being applied to an unknown program. Thus, the compiler uses information acquired from general programs before optimizing a specific program. In scheduling, there is no pre-training phase: The compiler does not try to generalize the behavior of a universe of programs to predict the behavior of an individual program.
Sampling: In autotuning, the compiler typically evaluates the target program once (although there are exceptions, like in the work of Cavazos et al. [2007]), using, as a guide, the behavior learned from observations made on the training set. In scheduling, the compiler is allowed to run the target program multiple times. Information acquired from these executions will guide the search for good optimizations.
Much of the current techniques employed in autotuning originate in the work of Cavazos and his collaborators [Agakov et al. 2006; Ashouri et al. 2018, 2016; Cavazos et al. 2006; Cavazos and O’Boyle 2006; Moss et al. 1997; Simon et al. 2013; Thomson et al. 2010]. The growing availability of predictive models and benchmarks to train these models has made autotuning works common in the recent literature [Brauckmann et al. 2020; Cummins et al. 2021a;, 2021b; Da Silva et al. 2021; Silva et al. 2021]. Figure 20 compares six autotuning-based techniques with six scheduling-based approaches. Although autotuning typically concerns the optimization of general programs, scheduling is mainly seen in the optimization of deep-learning models composed of kernels. Autotuning has been used, for instance, to find good sequences of optimizations for clang [da Silva et al. 2021; Silva et al. 2021] or to fine-tune the Hotspot JIT compiler [Cavazos and O’Boyle 2006].
Fig. 20.
Fig. 20. Qualitative comparison between different previous work. See Section 5 for the meaning of columns. We classify the first six works as autotuning and the last six as scheduling problems.
The transformation Space. Scheduling techniques usually fix the sequence of optimizations that form the search space and vary their parameters [Chen et al. 2018; Tollenaere et al. 2023; Zhang et al. 2021]. However, there are scheduling approaches that accept different transformation vectors [Phothilimthana et al. 2021; Meng et al. 2022; Zheng et al. 2020; Essadki et al. 2023]. Within the TVM community, these different transformation vectors are called templates. For instance, a common technique—adopted in TVM’s Ansor—is to assume that each interchange of loops forms a different template. Figure 20 distinguishes fixed vectors as “params” and templated vectors as “param list.” Autotuning usually fix the parameters of each optimization; however, they have more freedom to compose sequences of optimizations. For instance, Cavazos et al. [2007] form sequences of 500 optimizations drawn from a universe of 121 possible compilation flags. This approach is also adopted by Silva et al. [2021]: They produce lists of up to 100 elements drawn from approximately 80 compilation flags. Figure 20 uses the notation “list” to denote this way to build the search space. In Figure 20, we denote fixed-length vectors as “vec \([n]\) .”
The Search Guide. Techniques used in autotuning or scheduling differ in how the search for good transformation sequences is performed. In this article, we use Coordinate Descent. Our approach does not depend on static characteristics of the program; only on its dynamic behavior. Several search techniques use program features (static characteristics) to steer the search. These techniques are usually data-agnostic. An exception is the work of Da Silva et al. [2021], who use the runtime values of inputs to choose program configurations. Da Silva et al. capitalize on the convexity of the search space; however, in their case, tuning is guided by linear regression, not Coordinate Descent. Recent scheduling techniques have used analytical models to prune the search space [Kaufman et al. 2021; Zhang et al. 2021; Tollenaere et al. 2023; Mogers et al. 2022]. Mogers et al. [2022] have shown, in the context of the Lift compiler [Steuwer et al. 2016], that pruning can be very effective, as “only 1 out of 49,000 candidates [generated by random search to optimize convolution] satisfies the constraints [hence is valid].” Pruning is orthogonal to the search techniques that Section 5 evaluates. For instance, an interesting continuation of the ideas in this article would be to use Tollenaere et al.’s cost model to remove from the active neighborhood kernel configurations that are unlikely to improve on the best candidate seen by coordinate descent.

6 Conclusion

This article has introduced a new kernel scheduling technique—Droplet Search—and has demonstrated its effectiveness by optimizing the code of six different deep learning models on six different hardware architectures. Droplet Search relies on the following observation: The optimization space formed by code transformation parameters usually determines a convex region that includes the origin of this space (no optimization) and the best kernel configuration that this space contains. Experiments show that Droplet Search tends to find kernels as efficient as any of the other search techniques available in AutoTVM; however, it does so faster. Droplet Search also compares well with TVM’s Ansor, despite using a much smaller pool of optimizations.
Droplet Search is currently available in Apache TVM. This implementation still offers room for improvements. In particular, its convergence rate on large search spaces can be slow. Furthermore, this implementation could benefit more from parallelism, because it keeps only one best candidate at any time. We conjecture that Coordinate Descent could be modified to use multiple line searches, hence, offering more opportunities for parallelization. Finally, the current implementation of Droplet Search explores the parameters of only four TVM optimizations: tiling, unrolling, thread blocking, and shared memory tiling. Adding more optimizations to this list would be a welcome improvement to that implementation. All these ideas are directions that we hope to see explored in the future.

Acknowledgment

The authors extend their appreciation to the ACM Transactions on Architecture and Code Optimization reviewers for their valuable suggestions, which significantly contributed to the enhancement of this work.

Footnotes

2
Notions of iteration and data space are standard in the compiler literature. Such concepts appeared independently in the works of Feautrier [1991] and Wolf and Lam [1991], eventually leading to the concept known today as the Polyhedral Model.
3
It is not clear who invented Coordinate Descent. Descriptions of the algorithm can be found in classic textbooks [Zangwill 1969]. For a comprehensive overview, we recommend the work of Wright [2015].
4
Previous work has observed that speedups due to compiler optimizations may not follow a normal distribution [Álvares et al. 2021]. Student’s t-test is parametric, and, hence, not recommended for non-Gaussian distributions. However, non-parametric tests also have shortcomings. In particular, they tend to require more samples. As an example, the minimum recommended number of samples for Wilcoxon’s [Wilcoxon 1992] non-parametric test would be five, under a confidence level of 95%.
6
7
The A6000 GPU is only used in this section. Access to this hardware was kindly provided by the Discovery Lab (https://discovery.ic.unicamp.br/).
8
Roller, the tool implemented by Zhu et al. relies on RT Cores to speedup the execution of kernels. This tool uses functions that are exclusive of Cuda v10.2 and TensorFlow v1.15.2. However, the RTX 3080—the only GPU that we have with RT cores—is only compatible with Cuda v12.0 and TensorFlow v2.13.0. We could not downgrade the version of Cuda. Furthermore, a direct change of APIs, to upgrade the versions of Cuda and TensorFlow in Roller was not enough to let us reuse the tool: The updated version of the tool compiles successfully; however, it does not produce kernels. We faced similar issues when trying to reuse another artifact that targets RT cores: Heron [Bi et al. 2023]. Heron was implemented with the Cuda v11.0 API. We have also updated it to use Cuda v12.0. The updated version compiles and produces kernels; however, these kernels crash during execution.

References

[1]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning. In OSDI. USENIX Association, New York, NY, 265–283.
[2]
F. Agakov, E. Bonilla, J. Cavazos, B. Franke, G. Fursin, M. F. P. O’Boyle, J. Thomson, M. Toussaint, and C. K. I. Williams. 2006. Using machine learning to focus iterative optimization. In CGO. IEEE Computer Society, Washington, DC, 295–305. DOI:
[3]
Andrei Rimsa Álvares, José Nelson Amaral, and Fernando Magno Quintão Pereira. 2021. Instruction visibility in SPEC CPU2017. J. Comput. Lang. 66 (2021), 101062. DOI:
[4]
Amir H. Ashouri, William Killian, John Cavazos, Gianluca Palermo, and Cristina Silvano. 2018. A survey on compiler autotuning using machine learning. Comput. Surv. 51, 5 (2018), 96:1–96:42. DOI:
[5]
Amir Hossein Ashouri, Giovanni Mariani, Gianluca Palermo, Eunjung Park, John Cavazos, and Cristina Silvano. 2016. COBAYN: Compiler autotuning framework using bayesian networks. ACM Trans. Arch. Code Optim. 13, 2 (2016), 21:1–21:25. DOI:
[6]
D. Bertsekas. 2009. Convex Optimization Theory. Athena Scientific, Nashua, NH. https://books.google.com.br/books?id=0H1iQwAACAAJ
[7]
Jun Bi, Qi Guo, Xiaqing Li, Yongwei Zhao, Yuanbo Wen, Yuxuan Guo, Enshuai Zhou, Xing Hu, Zidong Du, Ling Li, Huaping Chen, and Tianshi Chen. 2023. Heron: Automatically constrained high-performance library generation for deep learning accelerators. In ASPLOS. Association for Computing Machinery, New York, NY, 314–328. DOI:
[8]
Alexander Brauckmann, Andrés Goens, Sebastian Ertel, and Jeronimo Castrillon. 2020. Compiler-based graph representations for deep learning models of code. In CC. Association for Computing Machinery, New York, NY, 201–211. DOI:
[9]
John Cavazos, Grigori Fursin, Felix Agakov, Edwin Bonilla, Michael F. P. O’Boyle, and Olivier Temam. 2007. Rapidly selecting good compiler optimizations using performance counters. In CGO. IEEE Computer Society, Los Alamitos, CA, 185–197. DOI:
[10]
John Cavazos, J. Eliot B. Moss, and Michael F. P. O’Boyle. 2006. Hybrid optimizations: Which optimization algorithm to use?. In Proceedings of the 15th International Conference on Compiler Construction (CC’06). Springer-Verlag, Berlin, 124–138. DOI:
[11]
John Cavazos and Michael F. P. O’Boyle. 2006. Method-specific dynamic compilation using logistic regression. In OOPSLA. Association for Computing Machinery, New York, NY, 229–240. DOI:
[12]
Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv:1512.01274. Retrieved from http://arxiv.org/abs/1512.01274
[13]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In OSDI (OSDI’18). USENIX Association, Berkeley, CA, 579–594.
[14]
Chris Cummins, Zacharias V. Fisches, Tal Ben-Nun, Torsten Hoefler, Michael F. P. O’Boyle, and Hugh Leather. 2021a. ProGraML: A graph-based program representation for data flow analysis and compiler optimizations. In ICML, Vol. 139. PMLR, Baltimore, MD, 2244–2253.
[15]
Chris Cummins, Bram Wasti, Jiadong Guo, Brandon Cui, Jason Ansel, Sahir Gomez, Somya Jain, Jia Liu, Olivier Teytaud, Benoit Steiner, Yuandong Tian, and Hugh Leather. 2021b. CompilerGym: Robust, performant compiler optimization environments for AI research. arXiv:2109.08267. Retrieved from https://arxiv.org/abs/2109.08267.
[16]
Anderson Faustino da Silva, Bruno Conde Kind, José Wesley de Souza Magalhães, Jerônimo Nunes Rocha, Breno Campos Ferreira Guimarães, and Fernando Magno Quintão Pereira. 2021. AnghaBench: A suite with one million compilable C benchmarks for code-size reduction. In CGO. IEEE, Los Alamitos, CA, 378–390. DOI:
[17]
Junio Cezar Ribeiro Da Silva, Lorena Leão, Vinicius Petrucci, Abdoulaye Gamatié, and Fernando Magno Quintão Pereira. 2021. Mapping computations in heterogeneous multicore systems with statistical regression on program inputs. ACM Trans. Embed. Comput. Syst. 20, 6, Article 112 (Oct.2021), 35 pages. DOI:
[18]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, New York, NY, 4171–4186. DOI:
[19]
Mohamed Essadki, Bertrand Michel, Bruno Maugars, Oleksandr Zinenko, Nicolas Vasilache, and Albert Cohen. 2023. Code generation for in-place stencils. In CGO. Association for Computing Machinery, New York, NY, 2–13. DOI:
[20]
Paul Feautrier. 1991. Dataflow analysis of array and scalar references. Int. J. Parallel Program. 20, 1 (1991), 23–53. DOI:
[21]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition. arXiv:1512.03385. Retrieved from http://arxiv.org/abs/1512.03385
[22]
Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861. Retrieved from http://arxiv.org/abs/1704.04861
[23]
Charles Jin, Phitchaya Mangpo Phothilimthana, and Sudip Roy. 2022. Neural architecture search using property guided synthesis. Proc. ACM Program. Lang. 6, OOPSLA2, Article 166 (Oct.2022), 30 pages. DOI:
[24]
Samuel J. Kaufman, Phitchaya Mangpo Phothilimthana, Yanqi Zhou, Charith Mendis, Sudip Roy, Amit Sabne, and Mike Burrows. 2021. A learned performance model for tensor processing units. In MLSys, Alex Smola, Alex Dimakis, and Ion Stoica (Eds.). mlsys.org, Indio, CA, 15.
[25]
S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. 1988. Optimization by Simulated Annealing. MIT Press, Cambridge, MA, 551–567.
[26]
Mikhail Lebedev and Pavel Belecky. 2021. A survey of open-source tools for FPGA-based inference of artificial neural networks. In IVMEM. IEEE, New York, NY, 50–56. DOI:
[27]
David A. Levine. 1969. Algorithm 344: Student’s t-distribution [S14]. Commun. ACM 12, 1 (Jan.1969), 37–38. DOI:
[28]
Jintao Meng, Chen Zhuang, Peng Chen, Mohamed Wahib, Bertil Schmidt, Xiao Wang, Haidong Lan, Dou Wu, Minwen Deng, Yanjie Wei, and Shengzhong Feng. 2022. Automatic generation of high-performance convolution kernels on ARM CPUs for deep learning. IEEE Trans. Parallel Distrib. Syst. 33, 11 (2022), 2885–2899. DOI:
[29]
Naums Mogers, Lu Li, Valentin Radu, and Christophe Dubach. 2022. Mapping parallelism in a functional IR through constraint satisfaction: A case study on convolution for mobile GPUs. In CC. Association for Computing Machinery, New York, NY, 218–230. DOI:
[30]
Eliot Moss, Paul Utgoff, John Cavazos, Doina Precup, Darko Stefanovic, Carla Brodley, and David Scheeff. 1997. Learning to schedule straight-line code. In NIPS. MIT Press, Cambridge, MA, 929–935. DOI:
[31]
Auguste Olivry, Guillaume Iooss, Nicolas Tollenaere, Atanas Rountev, P. Sadayappan, and Fabrice Rastello. 2021. IOOpt: Automatic derivation of I/O complexity bounds for affine programs. In PLDI. Association for Computing Machinery, New York, NY, 1187–1202. DOI:
[32]
Auguste Olivry, Julien Langou, Louis-Noël Pouchet, P. Sadayappan, and Fabrice Rastello. 2020. Automated derivation of parametric data movement lower bounds for affine programs. In PLDI. Association for Computing Machinery, New York, NY, 808–822. DOI:
[33]
Phitchaya Mangpo Phothilimthana, Amit Sabne, Nikhil Sarda, Karthik Srinivasa Murthy, Yanqi Zhou, Christof Angermueller, Mike Burrows, Sudip Roy, Ketan Mandke, Rezsa Farahani, Yu Emma Wang, Berkin Ilbeyi, Blake A. Hechtman, Bjarke Roune, Shen Wang, Yuanzhong Xu, and Samuel J. Kaufman. 2021. A flexible approach to autotuning multi-pass machine learning compilers. In PACT, Jaejin Lee and Albert Cohen (Eds.). IEEE, Los Alamitos, CA, 1–16. DOI:
[34]
Lakshminarayanan Renganarayana and Sanjay Rajopadhye. 2008. Positivity, posynomials and tile size selection. In SC. IEEE Press, Los Alamitos, CA.
[35]
Peter Richtárik and Martin Takác. 2012. Parallel Coordinate Descent methods for big data optimization. arXiv:1212.0873. Retrieved from http://arxiv.org/abs/1212.0873
[36]
Abhronil Sengupta, Yuting Ye, Robert Wang, Chiao Liu, and Kaushik Roy. 2018. Going deeper in spiking neural networks: VGG and residual architectures. arXiv:1802.02627. Retrieved from http://arxiv.org/abs/1802.02627
[37]
Anderson Faustino da Silva, Bernardo N. B. de Lima, and Fernando Magno Quintão Pereira. 2021. Exploring the space of optimization sequences for code-size reduction: Insights and tools. In Compiler Construction. Association for Computing Machinery, New York, NY, 47–58. DOI:
[38]
Douglas Simon, John Cavazos, Christian Wimmer, and Sameer Kulkarni. 2013. Automatic construction of inlining heuristics using machine learning. In CGO. IEEE Computer Society, Los Alamitos, CA, 1–12. DOI:
[39]
Michel Steuwer, Toomas Remmelg, and Christophe Dubach. 2016. Matrix multiplication beyond auto-tuning: Rewrite-based GPU code generation. In CASES. Association for Computing Machinery, New York, NY, Article 15, 10 pages. DOI:
[40]
Rafael Sumitani, Lucas Silva, Frederico Campos, and Fernando Pereira. 2023. A class of programs that admit exact complexity analysis via newton?S polynomial interpolation. In SBLP. Association for Computing Machinery, New York, NY, 50–55. DOI:
[41]
John Thomson, Michael O’Boyle, Grigori Fursin, and Björn Franke. 2010. Reducing training time in a one-shot machine learning-based compiler. In Languages and Compilers for Parallel Computing, Guang R. Gao, Lori L. Pollock, John Cavazos, and Xiaoming Li (Eds.). Springer, Berlin, 399–407.
[42]
Nicolas Tollenaere, Guillaume Iooss, Stéphane Pouget, Hugo Brunie, Christophe Guillon, Albert Cohen, P. Sadayappan, and Fabrice Rastello. 2023. Autotuning convolutions is easier than you think. ACM Trans. Archit. Code Optim. 20, 2, Article 20 (mar2023), 24 pages. DOI:
[43]
Xiao Wang, Amit Sabne, Sherman Kisner, Anand Raghunathan, Charles Bouman, and Samuel Midkiff. 2016. High performance model based image reconstruction. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’16). Association for Computing Machinery, New York, NY, Article 2, 12 pages. DOI:
[44]
Frank Wilcoxon. 1992. Individual Comparisons by Ranking Methods. Springer New York, New York, NY, 196–202. DOI:
[45]
Michael E. Wolf and Monica S. Lam. 1991. A data locality optimizing algorithm. In Proceedings of the ACM SIGPLAN 1991 Conference on Programming Language Design and Implementation (PLDI ’91). Association for Computing Machinery, New York, NY, 30–44. DOI:
[46]
Stephen J. Wright. 2015. Coordinate Descent algorithms. Math. Program. 151, 1 (Jun.2015), 3–34. DOI:
[47]
W. Zangwill. 1969. Nonlinear Programming, A Unified Approach (1st ed.). Prentice Hall, Hoboken NJ.
[48]
Xiaoyang Zhang, Junmin Xiao, and Guangming Tan. 2021. I/O lower bounds for auto-tuning of convolutions in CNNs. In PPoPP. Association for Computing Machinery, New York, NY, 247–261. DOI:
[49]
Jun Zheng, Suhail S. Saquib, Ken D. Sauer, and Charles A. Bouman. 2000. Parallelizable Bayesian tomography algorithms with rapid, guaranteed convergence. IEEE Trans. Image Process. 9, 10 (2000), 1745–1759. DOI:
[50]
Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, and Ion Stoica. 2020. Ansor: Generating high-performance tensor programs for deep learning. In OSDI. USENIX Association, Berkeley, CA, Article 49, 17 pages.
[51]
Hongyu Zhu, Ruofan Wu, Yijia Diao, Shanbin Ke, Haoyu Li, Chen Zhang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Wei Cui, Fan Yang, Mao Yang, Lidong Zhou, Asaf Cidon, and Gennady Pekhimenko. 2022. ROLLER: Fast and efficient tensor compilation for deep learning. In OSDI, Marcos K. Aguilera and Hakim Weatherspoon (Eds.). USENIX Association, Berkeley, CA, 233–248.

Index Terms

  1. The Droplet Search Algorithm for Kernel Scheduling

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 21, Issue 2
    June 2024
    520 pages
    EISSN:1544-3973
    DOI:10.1145/3613583
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 May 2024
    Online AM: 29 February 2024
    Accepted: 25 February 2024
    Revised: 19 December 2023
    Received: 18 July 2023
    Published in TACO Volume 21, Issue 2

    Check for updates

    Author Tags

    1. Tensor compiler
    2. optimization
    3. kernel scheduling
    4. search

    Qualifiers

    • Research-article

    Funding Sources

    • CNPq
    • FAPEMIG
    • FAPESP
    • CAPES

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 1,015
      Total Downloads
    • Downloads (Last 12 months)1,015
    • Downloads (Last 6 weeks)223
    Reflects downloads up to 21 Sep 2024

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media