# A Full-stack Accelerator Search Technique for Vision Applications

Dan Zhang<sup>1</sup>, Safeen Huda<sup>2</sup>, Ebrahim Songhori<sup>1</sup>, Quoc Le<sup>1</sup>, Anna Goldie<sup>1</sup>, and Azalia Mirhoseini<sup>1</sup>

<sup>1</sup>Google Brain

<sup>2</sup>Google

{dazh, safeen, esonghori, qvl, agoldie, azalia}@google.com

# Abstract

The rapidly-changing ML model landscape presents a unique opportunity for building hardware accelerators optimized for specific datacenter-scale workloads. We propose Full-stack Accelerator Search Technique (FAST), a hardware accelerator search framework that defines a broad optimization environment covering key design decisions within the hardware-software stack, including hardware datapath, software scheduling, and compiler passes such as operation fusion and tensor padding. Although FAST can be used on any number and type of deep learning workload, in this paper we focus on optimizing for a single or small set of vision models, resulting in significantly faster and more powerefficient designs relative to a general purpose ML accelerator. When evaluated on EfficientNet [45], ResNet50v2 [18] and OCR [36] inference performance relative to a TPUv3 [22], designs generated by FAST optimized for single workloads can improve Perf/TDP (peak power) by over 6x in the best case and 4x on average. On a limited workload subset, FAST improves Perf/TDP 2.85x on average, with a reduction to 2.35x for a single design optimized over the set of workloads. In addition, we demonstrate a potential 1.8x speedup opportunity for TPU-v3 with improved scheduling.

### 1. Introduction

Accelerator design and deployment is a multi-year process [22, 23]. As a result, new deep learning accelerators are typically optimized for neural networks that were developed years ago and may not perform as efficiently on the newest state-of-the-art models. One such model is Efficient-Net [45], which delivers state-of-the-art accuracy while also minimizing model parameter count and Floating Point Operations (FLOPS) through its extensive use of depthwiseseparable convolutions [19]. However, number of FLOPS is not an accurate proxy for performance on state-of-theart accelerators such as Google TPUs or NVIDIA GPUs [5], on which EfficientNets tend to run inefficiently due to their reduced operational intensity and parallelism [30].



Figure 1: FAST-Large (described in Table 4) inference latency vs. ImageNet top-1 accuracy. Faster hardware accelerators can run larger, more accurate models with the same latency budget, or significantly reduce inference latency and throughput given a fixed accuracy budget. These techniques do not impact on model accuracy; quantization can bring further gains but is outside the scope of this paper.

EfficientNet-X [30] introduces a new network architecture search scheme in which the depthwise separable convolution can be replaced with a standard convolution, yielding a 43% inference speedup on TPU-v3 running EfficientNet-B7. However, the poor performance of depthwise convolutions on current accelerators remains a challenge.

Motivated by this observation, we propose a new *Full-stack Accelerator Search Technique* (FAST) that enables the design and evaluation of custom accelerators optimized for one model or family of neural network models. As shown in Figure 2, FAST is an automated architecture search framework for optimizing the accelerator stack from the software compiler all the way to the underlying hardware architecture. When provided with a set of deep learning models and a set of performance requirements (e.g. area, latency, power thresholds), FAST performs design space exploration and outputs a Pareto-optimal curve corresponding to designs with varying performance trade-offs.

Hardware accelerator architectures can be described in

terms of their datapath and schedule, where the datapath comprises the hardware components (e.g. compute units, scratchpad memories, and connectivity) on which neural network operations are run, and the schedule comprises the compiler scheduling and hardware control logic that maps these operations onto the datapath. Common datapath designs use grids of processing elements (PEs), including scalar [6, 33, 38, 43, 49], vector [41, 46, 50, 51, 56], or matrix [23] compute units. FAST's datapath template is capable of expressing scalar, vector, and matrix processing elements. FAST is also equipped with a highly versatile memory hierarchy search space.

A key component of FAST is our flexible simulator for evaluating a hardware accelerator's performance for a given neural network. We built a fast and accurate simulation platform capable of modeling a wide range of hardware datapaths and schedules by extending a heavily modified version of Timeloop [35] and addressing its key limitations as discussed in Section 5.1. Another key component is *FAST fusion*, an *Integer Linear Programming* (ILP) based multilayer fusion technique which significantly improves memory bandwidth usage efficiency and thus inference execution time. Finally, FAST also performs tensor padding as a pre-processing step to improve efficiency.

To our knowledge, our framework is the first to enable such a broad software and hardware co-optimization search space. FAST is capable of jointly optimizing hardware datapath, software schedule, and compiler passes such as operation fusion and tensor padding, with a combined search space of up to  $\mathcal{O}(10^{2300})^1$ . Figure 1 shows the potential of FAST on EfficientNets with software-only updates (i.e. schedule, fusion, and padding) applied to a fixed hardware configuration, with even larger speedups possible when running full SW/HW co-optimization for each model.

We evaluate FAST on computer vision-centric inference models in Section 5 and show a breakdown of the gains that can be attributed to each component of our optimization framework. We report results in terms of important metrics, including performance (in inferences per second) and performance per watt as defined by Thermal Design Power (TDP), which is known to be correlated with Total Cost of Ownership (TCO) [23]. When evaluated on EfficientNet [45], ResNet50v2 [18] and OCR [36] inference performance relative to a TPU-v3 [22], designs generated by FAST optimized for single workloads can improve Perf/TDP (peak power) by over 6x in the best case and 4x on average. On a limited subset of workloads, FAST improves Perf/TDP 2.85x on average, with a reduction to 2.35x for a single design optimized over the set of workloads. In addition, we demonstrate a potential 1.8x speedup opportunity for TPU-v3 with improved scheduling.



Figure 2: Full Stack Accelerator architecture Search Technique (FAST) overview.

# 2. Related Work

Accelerator design space: Hardware ML accelerators can be described in terms of their hardware datapath and software schedule. Datapath designs often use grids of processing elements comprised of scalar [6, 33, 38, 43, 49], vector [41,46,50,51,56], or matrix [23] compute units. Prior work has performed design space exploration by mutating datapath hyperparameters, such as the number of PEs and buffer capacities [24, 33, 38, 43, 46, 49, 50, 56], as well as the mapping of convolutions onto the datapath [24,38,43,46,49]. As described in Section 4.4, our datapath template is designed to be an approximate superset of popular designs capable of expressing scalar, vector, and matrix processing elements with varying memory hierarchies. Our search space also includes scheduling and other compiler optimizations, such as op fusion and tensor padding, enabling us to cover a much broader set of architectures. We also optimize for stateof-the-art vision models including EfficientNet and demonstrate that our large co-optimization space allows for significant improvements over existing datacenter accelerators.

A flexible scheduler is key to evaluating accelerator performance for a given neural network. Timeloop [35] and MAESTRO [28] use random search to optimize accelerator schedules given a datapath and layer definition. However, they only evaluate single layers and only consider convolution operation performance, thereby limiting utility for end-to-end performance evaluation and optimization (e.g. operator fusion, parameter prefetching). Compared to Timeloop, the MAESTRO datapath design space is more restrictive, assuming a fixed scalar PE architecture with private L1 and global L2 scratchpads. Interstellar [49] uses Halide [37] to generate and analyze inference accelerators. Although Interstellar can perform many blocking and spatial optimizations, its datapath search space is limited to basic systolic arrays and reduction trees with global buffers. To create an accelerator simulation platform capable of modeling a diverse set of datapaths and schedules, we integrated a modified version of Timeloop into a production ML accelerator simulator as described in Section 5.1.

Within FAST, we have also developed FAST fusion, an efficient ILP-based multi-layer fusion technique which significantly improves memory bandwidth usage efficiency and thus inference execution time. While there has been

<sup>&</sup>lt;sup>1</sup>This estimate takes the product of the *mapspace* [35] sizes for each layer in a moderately-sized model like ResNet-50 ( $\sim 10^{2000}$ ), combined with the  $10^{14}$  datapath and  $10^{300}$  op fusion search spaces, rounded down.

a growing body of work on fusion [3, 4, 25, 34, 39, 57], to our knowledge, our work is the first to co-optimize datapath, schedule and fusion. Due to memory bandwidth bottlenecks caused by poor operational intensity as discussed in Section 3.3.1, we demonstrate in Section 5.2.6 that fusion is the key to unlocking significant performance gains.

Accelerator search on Field-Programmable Gate Arrays (FPGAs): Several recent efforts have targeted the acceleration of neural networks on FPGAs, which unlike ASICs enable flexible hardware reconfiguration. These prior works primarily focused on automation tools and design space exploration for one particular neural network [3,17,42,53,55]. However, the main challenge with FPGAs is that their flexibility comes at the cost of reduced performance and higher energy consumption [27]. Unlike prior work, our framework enables the exploration of a broad range of datapaths, schedule, and fusion.

**Co-optimization of neural networks and hardware:** More recently, co-optimizing neural networks and accelerators has gained significant attention [1, 17, 21, 31, 48, 52, 54, 56]. The co-optimization design space contains variations of both the neural network architecture and hardware components, and the objective is to jointly optimize for both accuracy and performance. While our framework does not currently allow modifications to the neural network architecture, it would be straightforward to extend it to enable co-design search. However, even with only hardware accelerator optimization, FAST already delivers significantly higher performance than previous work [56] through the larger search space covering datapath, schedule, and fusion. We would expect to see further gains if we also allowed cooptimization with neural network architecture design.

# 3. Background and Motivation

#### **3.1. Mapping Convolutions onto Accelerators**

Since convolutions dominate the overall runtime in convolutional neural networks (CNNs), considerable effort has been expended on software [29] and hardware [6, 10] acceleration of these operations. A standard Conv2D can be represented as a 7-dimensional nested loop over batch size (*B*), output tensor height and width (*IH*, *IW*), number of input and output features (*IF*, *OF*), and kernel height and width (*KH*, *KW*). Since these loop iterations are commutative, compilers can freely modify loop traversal order, allowing for arbitrary transformations in tensor data layout format, loop blocking, and spatial vectorization [28, 35]. A number of recent works have exploited these properties to build efficient high-performance accelerators [24, 43, 49].

Systolic arrays combine parallel operations with local communication, making them well-suited to matrix computations [26]. To multiply two matrices, one matrix is latched into internal registers<sup>2</sup>, while the other is streamed through the array. Accelerators such as Google's TPU family [22] exploit the dense compute enabled by systolic arrays to accelerate training and inference. Under a *weight stationary* mapping [9], the systolic array will not be fully utilized unless *IF*, *OF*, and *B* are multiples of the dimensions of the systolic array. Alternative mappings, such as *output stationary* and *row stationary* [6], may achieve higher utilization by selecting alternative dimensions to be spatially unrolled, but are still limited by dimensional constraints. Therefore, although larger systolic arrays are more area and power-efficient, they tend to have lower utilization.

### 3.2. Depthwise-Separable Convolution

CNNs are often over-parameterized [16, 20]. A popular method for reducing model size and compute cost is replacing Conv2D with a depthwise-separable convolution: a depthwise convolution combined with a 1x1 point-wise convolution [7, 40, 44]. For example, a 3x3 depthwiseseparable convolution uses 8-9x less compute than a standard Conv2D with only a slight reduction in accuracy [19]. By achieving state-of-the-art Top-1 accuracy on ImageNet while significantly reducing model parameters and FLOPS, EfficientNet [45] demonstrated that depthwise-separable convolutions were viable outside of compute and storageconstrained settings. However, depthwise-separable convolutions do not map well onto TPUs due to poor systolic array utilization and operational intensity. By reducing kernel filter depth (IF) to 1, depthwise convolutions allow significant parameter and compute reduction, but common mappings unfortunately depend on large IF for good utilization. For example, assuming a depthwise convolution with a 3x3 kernel, maximum utilization for a 128x128 systolic array is only KH \* KW = 9 out of 128. Utilization can be improved with smaller systolic arrays or alternative mappings.

### 3.3. Case Study: EfficientNet Performance Analysis

In the following section, we analyze various contributing factors to EfficientNet performance on TPU-v3. We first characterize EfficientNet in terms of operational intensity and discuss the impact of op fusion. We then analyze the implications of TPU-v3 architecture and compute scheduling strategy on EfficientNet. These characterizations motivated us to build a comprehensive hardware and software search space for FAST, that is able to deliver significant performance improvements for EfficientNets and the broader set of popular vision models such as ResNet and OCR.

#### 3.3.1 Operational Intensity and Op Fusion

ML model graphs are executed on accelerators as a series of kernels, or *operations*, where each op reads its inputs from device memory (DRAM), transfers these inputs to on-chip memory, performs the computation, and writes the output back to DRAM. This results in unnecessary DRAM reads and writes for intermediate values which are usually performed in parallel with computation, but may cause slow-downs with insufficient bandwidth. To determine if a model

<sup>&</sup>lt;sup>2</sup>Double-buffering is typically employed to mask the latency of latching a new set of parameters into the systolic array [23]



Figure 3: The impact of op fusion on operational intensity. Models with op intensity below 200 are memory bandwidthbottlenecked on current accelerators. ResNet-50 does not contain depthwise-separable convolution (DSConv) or mobile inverted residual (MBConv) blocks. Increasing batch size is effective for ResNet-50, but insufficient for EfficientNet. Supporting future accelerators with op intensity over 400 is possible, but will require more advanced fusion techniques.

| Model           | Max Working Set | Weights  |
|-----------------|-----------------|----------|
| EfficientNet-B0 | 2.87 MiB        | 12.7 MiB |
| EfficientNet-B1 | 3.3 MiB         | 22.1 MiB |
| EfficientNet-B2 | 3.9 MiB         | 26.1 MiB |
| EfficientNet-B3 | 5.1 MiB         | 36.8 MiB |
| EfficientNet-B4 | 12.4 MiB        | 61.4 MiB |
| EfficientNet-B5 | 17.8 MiB        | 101 MiB  |
| EfficientNet-B6 | 31.9 MiB        | 146 MiB  |
| EfficientNet-B7 | 41.2 MiB        | 231 MiB  |

Table 1: EfficientNet on-chip storage requirements (bfloat16). Working set sizes are shown for the op with the largest memory footprint at batch size 1. The storage requirements of larger EfficientNets exceed on-chip memory capacity, requiring more advanced op fusion techniques.

is compute or memory bandwidth-bound, one can calculate a model's *operational intensity*, defined as the ratio of compute operations (in FLOPS) to DRAM accesses (in bytes). For example, a TPU-v3 chip supports 123 TFLOPS/s of bfloat16 compute and 900GB/s memory bandwidth [22]. Therefore, a model that can otherwise operate at full compute utilization must have an operational intensity of at least 137 FLOPS/B to avoid becoming memory-bound. Note that it is cheaper to scale compute performance than memory bandwidth due to the *memory wall* [47]. The latest NVIDIA A100 GPU supports 312 TFLOPS bfloat16 with 1.5TB/s bandwidth [8], requiring an operational intensity of 208 FLOPS/B to prevent bandwidth bottlenecks.

Compilers such as TensorFlow XLA [15] mitigate this issue with *operation fusion*, merging multiple ops into one large op to avoid DRAM accesses of intermediate results, resulting in greater operational intensity and improved performance [22]. Most prior work has focused on training, where intermediate results must be preserved for the backwards pass [2, 32]. In this work, we focus on inference, which does not require a backwards pass, meaning that intermediate results may be immediately discarded after use.

Figure 3 shows that EfficientNet has low operational intensity due to its heavy use of depthwise-separable convolutions. Without op fusion, EfficientNet operational intensity ranges from 13 to 35 FLOPS/B, far below the level required to run without memory bottlenecks on TPU-v3 or A100. Using batching to amortize weight accesses across multiple inferences is effective for ResNet-50, but not for Efficient-Net due to its lower parameter count. As such, these workloads present a significant challenge to architects, since provisioning greater memory bandwidth can result in exorbitant incremental costs. However, by aggressively fusing the entire MBConv [40] block, we are able to achieve an operational intensity greater than 200 FLOPS/B. Furthermore, there is still considerable headroom in operational intensity, as evidenced by the performance of the ideal case where all of the model weights are pinned [9] and only the input image and final output results require memory accesses. These insights, along with the ephemeral nature of activations in inference workloads, motivated the aggressive fusion strategies described in Section 4.5. Enabling automatic exploration of fusion strategies greatly improved the efficacy of our architecture search technique, allowing joint optimization of efficient hardware architectures and the manner in which computation is mapped onto the architecture.

Aggressive op fusion and weight pinning comes at the cost of significant on-chip storage capacity, as shown in Table 1. We define an op's *working set* size as the size of its input activations and outputs for a given batch size, and we define a model's working set size as the working set size of its largest op. Since working sets scale linearly with batch size, fusion tends to perform better at smaller batch sizes since more tensors will fit into SRAM. However, larger batch sizes can improve systolic array utilization, resulting in higher overall performance. Determining the most favorable resource allocation between compute and memory depends on the specific working set and weight storage sizes for a target workload, which we address through FAST.

#### 3.3.2 Scheduling and Resource Utilization

To demonstrate the impact of how operations are mapped onto accelerators, we profiled EfficientNet-B7 performance on TPU-v3. Figure 4 shows the performance of each MB-Conv block in the model as a fraction of peak TPU-v3 compute (FLOPS). Initial layers have poor utilization, with utilization improving as the number of input/output channels increases. Overall TPU-v3 utilization on EfficientNet-B7 is only 14.8%, suggesting a potential 6.75x performance up-



EfficientNet-B7 Layer Number

Figure 4: EfficientNet-B7 per-layer performance as a fraction of peak FLOPS on TPU-v3. Earlier layers have low utilization due to having few channels. A good utilization ratio should exceed 0.7. Smaller EfficientNets have worse utilization due to having fewer channels.

| Ор Туре               | FLOP Percentage | Runtime Percentage |
|-----------------------|-----------------|--------------------|
| DepthwiseConv2dNative | 5.00%           | 65.30%             |
| Conv2D                | 94.67%          | 34.20%             |
| Other                 | 0.33%           | 0.50%              |

Table 2: EfficientNet-B7 per-op performance as a fraction of total execution time on TPU-v3. Depthwise convolutions consume the majority of execution time, due to their poor mapping efficiency on TPU-v3.

side with a better-designed architecture and mapper with similar peak FLOPS capable of reaching full utilization.

To identify the cause of low average utilization, we examined EfficientNet-B7 operation performance as a fraction of total execution time on TPU-v3 as shown in Table 2. The culprit is clear: depthwise convolutions comprise the majority of overall runtime, but only utilize a small fraction of total compute. An accelerator design that balanced depthwise convolution performance with regular convolution performance would therefore see significant speedups on EfficientNet. We now discuss how this can automatically be achieved through the use of FAST.

# 4. Methodology

FAST is a full-stack accelerator search technique for automatically designing custom accelerators optimized for a given set of ML workloads and subject to a set of constraints as shown in Figure 2. The search technique will be described in detail in the following sections.

### **4.1. Problem Definition**

Our objective is to find an optimized set of hyperparameters h for the hardware datapath, scheduler, and op fusion, given user-defined workloads w, objective function f(i.e. minimizing any function of power, area, and latency/throughput), subject to cost constraints (e.g. maximum area a or thermal design power p). Concretely, our optimization problem may be described by:

$$\min_{h,w} f(h,w) \tag{1}$$

s.t. 
$$Area(h) \le a$$
,  $TDP(h) \le p$ , (2)

$$ScheduleFailures(h, w) = 0,$$
 (3)



Figure 5: Accelerator datapath configuration. PEs are connected by a mesh on-chip network. PE systolic arrays perform a matrix-vector multiply each cycle. Vector and scalar PEs can be modeled by setting systolic array X and/or Y dims to 1. L2 and Global memory structures are optional.

The constraint ScheduleFailures(h, w) = 0 ensures that workload w can be successfully mapped onto the architecture described by the hyperparameters h.

#### 4.2. Overview of Our Framework

As shown in Figure 2, to address this problem, FAST first explores the hardware datapath with a black box optimizer (Google Vizier [14]), iteratively proposing new choices of hyperparameters h that define candidate architectures. Our architectural simulator, described in Section 5.1, then simulates the execution of target workloads on the candidate architecture. Compute-intensive ops such as Conv2D are optimized via pre-processing passes, such as tensor padding optimization, before calling Timeloop [35] to determine the best schedule and predicted op performance. The compute and memory access statistics for each op are then passed to FAST fusion, which outputs final execution time and power for the target workloads. These results are then returned to Vizier which proposes the next set of hyperparameters.

#### 4.3. Safe Exploration for Datapath Optimization

Many real-world problems such as architecture search can be formulated as optimizing for a set of objectives, while adhering to safety constraints (e.g. requirements on TDP and area, and avoiding inputs for which our simulator cannot produce a valid mapping). Vizier supports Bayesian optimization and other optimization methods (e.g. linear combination search (LCS) [13]) in settings with arbitrary constraints that may not be known in advance [12]. Armed with smooth probabilistic constraints (e.g. how close it came to violating a given constraint), Vizier can effectively avoid unsafe regions, while quickly exploring safe regions. Compared to scheduling and fusion, datapath optimization has a large and highly non-convex search space with expensive cost function evaluation (up to 2 hours/sample), making it well-suited to a black box optimizer such as Vizier.

### 4.4. Architectural Search Parameters

As shown in Figure 5, we target a highly-parameterized and general ML accelerator template capable of model-

| Parameter Name              | Туре | Potential Values          |  |
|-----------------------------|------|---------------------------|--|
| PEs_x_dim                   | int  | 1 to 256, powers of 2     |  |
| PEs_y_dim                   | int  | 1 to 256, powers of 2     |  |
| Systolic_array_x            | int  | 1 to 256, powers of 2     |  |
| Systolic_array_y            | int  | 1 to 256, powers of 2     |  |
| L1_buffer_config            | enum | Private, Shared           |  |
| L1_input_buffer_size        | int  | 1KB to 1MB, powers of 2   |  |
| L1_weight_buffer_size       | int  | 1KB to 1MB, powers of 2   |  |
| L1_output_buffer_size       | int  | 1KB to 1MB, powers of 2   |  |
| L1_total_buffer_size        | int  | 1KB to 1MB, powers of 2   |  |
| L2_buffer_config            | enum | Disabled, Private, Shared |  |
| L2_input_buffer_multiplier  | int  | 1x to 128x, powers of 2   |  |
| L2_weight_buffer_multiplier | int  | 1x to 128x, powers of 2   |  |
| L2_output_buffer_multiplier | int  | 1x to 128x, powers of 2   |  |
| L2_total_buffer_multiplier  | int  | 1x to 128x, powers of 2   |  |
| L3_global_buffer_size       | int  | 0MB to 256MB, powers of 2 |  |
| GDDR6_channels              | int  | 1 to 8, powers of 2       |  |
| Native_batch_size           | int  | 1 to 256, powers of 2     |  |

Table 3: Accelerator datapath search space with  $10^{14}$  possible values. When combined with scheduling and op fusion search spaces, the FAST total search space exceeds  $10^{2300}$ . Other memory technologies can easily be modeled.

ing a wide range of previously proposed architectural designs. Unlike prior work which targets specific families of accelerators, we enlarged our datapath search space to cover an approximate superset of popular accelerator families based on grids of processing elements (PEs), as described in Table 3. The TPU family of accelerators instantiate large systolic arrays coupled with two levels of shared memory. This can be represented in our framework by setting the systolic array dimensions to the appropriate values, setting L1\_buffer\_config to Shared, and L2\_buffer\_config to Disabled. Many accelerators such as Eyeriss [6] use flexible scalar PEs with per-PE buffers for input activations, weights, and output activations. This design can be reached by setting systolic array X and Y dimensions to 1, and L1\_buffer\_config to Private. Several edge accelerators proposed in industry such as Simba [41] and EdgeTPU [56] use vector PEs, which can be represented by setting the systolic array X dimension to 1. While our datapath search space cannot perfectly cover all possible designs, it is still significantly larger than those used in previous work [43, 46, 49]. We plan to further extend the search space in future work.

# 4.5. FAST Fusion

As discussed in Section 3.3.1, modern models such as EfficientNet demand minimization of DRAM accesses due to their poor inherent operational intensities. We developed a technique to automatically optimize DRAM accesses through strategic utilization of the SRAM-based Global Memory shown in Figure 5, which offers significantly higher access bandwidth. This technique, which we call *FAST fusion*, extends traditional op fusion to allow multiple layers to be fused together; given sufficient Global Memory capacity, the algorithm is flexible enough to allow the entire graph to be fused. FAST fusion ensures that some combination of input activations, output activations, and weights of memory-bound layers are resident in on-chip SRAM, leading to performance improvements.

Given that Global Memory is limited, we are presented with a constrained optimization problem; fortunately however, this may be expressed as an integer linear program. We are given an input graph G(V, E) representing an *n*-layer, partially fused<sup>3</sup> neural network which we wish to optimize, where each vertex  $v \in V$  is a layer of the network, while each edge e = (u, v) represents an activation dependency from layer u to v (that is, the output activation of u is an input to v). Let  $F_{in}(v)$  and  $F_{out}(v)$  represent the fan-in and fan-out sets, respectively, of some vertex  $v \in V$ . We assume that G has the property that while  $0 \le |F_o(v)| \le n-1, 0 \le n-1$  $|F_i(v)| \leq 1$ . To simplify notation, let  $D_t := \{I, O, W\}$ represent the set of data types used to annotate variables, where I, O, and W represent input activations, output activations, and weights, respectively. Given a known execution order  $o: v \in V \longrightarrow \{0, \ldots, n-1\}$  for each network layer, we may express the optimization problem as follows:

$$\begin{array}{ll} \min_{\substack{p_i^k \\ p_i^k \\ p_i^k \\ p_i^k \\ p_i^l \geq T_i^{min} \\ T_i \geq T_i^{max} - \sum_{k \in D_t} t_i^k \cdot p_i^k \\ C_{GM} \geq B_i + \sum_{k \in D_t} d_i^k \cdot p_i^k + \sum_{j \in V, j \neq i} W_j \cdot p_j^W \\ p_i^O \geq p_j^I \qquad \forall j \in F_{out}(i) \\ \sum_{\substack{j \in F_{out}(i) \\ p_i^I \geq p_i^O \\ M \cdot (1 - p_i^I) \geq o(i) - o(F_{in}(i)) - 1 \\ p_i^k \in \{0, 1\} \qquad \forall k \in D_t \end{array} \right.$$
(4)

The variable  $p_i^k$  is a binary decision variable indicating whether the tensor of type  $k \in D_t$  for layer *i* is to be placed in the Global Memory (if equal to 1), while the variable  $T_i$  represents the optimized execution time for layer *i* as a function of  $p_i^k$ .  $T_i^{min}$  and  $T_i^{max}$  are the execution times for layer *i* when the inputs and outputs of the layer are pinned exclusively in the Global Memory and DRAM, respectively (these are obtained from Timeloop evaluation of the layer). The parameter  $t_i^k$  is time to access layer *i*'s tensor of type k (where  $k \in D_t$ ) from DRAM,  $C_{GM}$  is the capacity of the Global Memory in bytes,  $B_i$  is the nominal global buffer usage of layer *i*,  $d_i^k$  is the difference between the size of layer i's tensor of type k and the corresponding tile size allocated on the global buffer if we were to assume the tensor is being streamed from/to DRAM,  $W_i$  is the size of layer j's weight tensor, and  $M \ge n-1$  is an arbitrarily large constant. Note that the constraints imply that activations are only stored in the global buffer if the op consuming an activation executes immediately after the op which produces

<sup>&</sup>lt;sup>3</sup>That is, G(V, E) is derived from an original m-layer network that has been optimized such that combinations of data formatting, element-wise, and matrix operations have been grouped in fused computations [22].

the activation. This also means that in cases where a node has multi-fanout (e.g., skip connections), at most only one op in the fanout cone will benefit from reading its input activation from global memory. These constraints – which limit the maximum potential upside of the technique – were imposed because of some limitations in our simulation infrastructure. Future work will address these limitations, thereby potentially allowing for further performance gains.

# 5. Evaluation

# 5.1. Experimental Setup

Architectural Performance Simulator: We modified a production ML accelerator performance simulator to enable modeling of a wide range of architectures as described in Section 4.4. Our simulator takes unmodified XLA HLO graphs [15] as input and employs Timeloop [35] to evaluate the performance of Conv2D, DepthwiseConv2D, and MatMul operations. Since Timeloop cannot handle problem dimensions that do not factorize cleanly into hardware datapath dimensions, our simulator performs a padding preprocessing step to improve op utilization. All other HLO ops are modeled using our simulator's default cost models since they are a poor fit for Timeloop. We constrain the Timeloop spatial search space to reduce runtimes based on the scheduling heuristic selected by Vizier, such as weightstationary or output-stationary. When a simulated datapath design point results in any number of Timeloop scheduling failures, the simulator result and datapath design point are both considered invalid. To estimate candidate architecture area and power consumption, we built analytical models correlated to production designs on an industry sub-10-nm manufacturing process. FAST fusion's ILP is solved using SCIP v7.0.1 [11], and is configured with a 20 minute time-out; if an optimal solution is not found in that time the solver returns the best incumbent solution.

Workloads: Although this paper has focused on Efficient-Net performance, it is important to consider how well our technique can generalize to other prominent computer vision-centric workloads. In addition to EfficientNet, we evaluate on ResNet50v2 [18], one of the most popular CNN-based models. We also consider two components of a production OCR pipeline described in [36]. OCR-RPN is the first stage in a standard Mask R-CNN implementation used to propose candidate text regions of interest. OCR-Recognizer is an LSTM-based model within the recognizer pipeline. These workloads were selected based on their varying performance characteristics on TPU-v3. As discussed in Section 3.3.2, EfficientNets currently run less efficiently on TPUs due to their use of depthwise-separable convolutions. ResNet50v2 runs much more efficiently than EfficientNet because it uses standard Conv2D operations. As production models, OCR-RPN and OCR-Recognizer are already optimized for the TPU architecture and thus will benefit less from our technique.

Optimization framework: As described in Section 4.3,

we use Google Vizier [14] enabling LCS optimization [13], with 5000 trials per experiment.

# **5.2. Experimental Results**

### 5.2.1 Overall Speedup

Figure 6 shows FAST overall performance improvement on each workload relative to a TPU-v3 baseline, in which performance is measured in processed inference queries per second (QPS). FAST is given a power and area budget similar to the current-generation TPU-v3, but on a new process technology, emulating the methodology used by accelerator architects to design next-generation products. We evaluate FAST tuned for individual workloads as well as across multiple workloads. Our multi-workload optimized FAST is evaluated based on the geometric mean across EfficientNet-B7, ResNet50v2, OCR-RPN, and OCR-Recognizer, achieving a 2.9x speedup over TPU-v3 baseline. Overall speedups are much higher on EfficientNets due to their use of depthwise separable convolutions. When provided with pure performance as the objective, FAST finds large designs that come close to our maximum area and TDP constraints. The production OCR workloads are already well-optimized for TPU-v3 and thus have the lowest gains. Utilizing FASTspecified scheduling and fusion on the TPU-v3 datapath provides a substantial 2x speedup; however, this is optimistic since implementing the generated schedules and achieving the projected speedup on existing hardware may require hardware changes. Tuning an architecture across multiple workloads results in slightly lowered, but still substantial improvements over the baseline. When utilizing full FAST search optimizing for each specific workload, we achieve a 5.6x average speedup.

Absolute performance numbers can be misleading since different hardware designs vary in cost. A common metric for evaluating datacenter accelerators is to normalize for these differences by considering performance per TCO (Total Cost of Ownership), which includes initial capital expenses and recurring operating costs such as electricity. While TCO numbers are not published because of the sensitive nature of the data, we can use Thermal Design Power (TDP) as a proxy for TCO since the two are highly correlated [23]. Figure 7 shows Perf/TDP numbers relative to a hypothetical TPU-v3 built on the same sub-10-nm process technology. When optimizing for Perf/TDP, FAST finds balanced designs that are smaller than our maximum area and TDP constraints, but achieve high compute utilization with minimal memory bandwidth bottlenecks. Overall, FAST individually optimized for each workload improves Perf/TDP on average by 4x across all workloads and 2.85x on the reduced workload suite, whereas FAST optimized for multiple workloads still improves Perf/TDP by 2.35x.

#### 5.2.2 Search Convergence Rate

We evaluated several black box optimizer heuristics as provided by Vizier. In Figure 8, we compare the conver-



Figure 6: Modeled inference throughput relative to TPU-v3. FAST's SW-only optimization, including scheduling and op fusion techniques offers large speedups over existing TPU-v3 hardware. Speedups are much larger when FAST is also allowed to search over the datapath. Average results correspond to the geometric mean of speedup across all workloads.



Figure 7: Modeled inference throughput per TDP (peak power draw) relative to TPU-v3, normalized to the same manufacturing process technology. FAST demonstrates large Perf/TDP wins across all workloads.



Figure 8: Search convergence rate on EfficientNet-B7 for Bayesian, random, and Linear Combination Swarm [13].



Figure 9: EfficientNet-B7 performance vs. TDP relative to TPU-v3 on the same process technology. Performance is defined in terms of step time, i.e. inverse throughput.

gence rate of Vizier's default Bayesian algorithm against Linear Combination Swarm (LCS) [13] and random sampling when optimizing for Perf/TDP on EfficientNet-B7. We show the mean and 90% confidence interval across each heuristic, across 5 runs per heuristic. LCS outperforms the other heuristics when trials exceed 2000.

#### 5.2.3 Pareto Frontier

To evaluate our search space coverage, in Figure 9 we characterize the relationship between TDP and performance on

|                        | Modeled TPU-v3 | FAST-Large | FAST-Small |
|------------------------|----------------|------------|------------|
| Normalized TDP         | 0.5x           | 0.4x       | 0.15x      |
| Normalized Area        | 0.6x           | 0.7x       | 0.3x       |
| Peak Compute           | 123 TFLOPS     | 131 TFLOPS | 32 TFLOPS  |
| Peak Bandwidth         | 900 GB/s       | 448 GB/s   | 448 GB/s   |
| PE Dimensions          | 128x128        | 32x32      | 64x32      |
| Num PEs                | 2x2            | 64         | 8          |
| Global Buffer Size     | 2x16 MiB       | 128 MiB    | 8 MiB      |
| Batch Size             | 2x64           | 8          | 64         |
| Compute Utilization    | 0.14           | 0.61       | 0.74       |
| Pre-fusion Mem Stall % |                | 63%        | 21%        |
| Fusion Efficiency      |                | 85%        | 0%         |
| OpInt Ridgepoint       | 137            | 292        | 73         |
| Fused Model OpInt      | 63             | 383        | 63         |
| Performance (QPS)      | 210            | 733        | 241        |
| Perf/TDP               | 1              | 3.9        | 3.9        |

Table 4: Two example designs found by FAST optimized for EfficientNet-B7 with similar overall Perf/TDP. Area and power are normalized to threshold constraints. TPU-v3 contains two TensorCores treated as two separate accelerators.

EfficientNet-B7. Each figure is normalized to a hypothetical TPU-v3 built with the same sub-10nm process technology at (1.0, 1.0), and points located towards the bottom left are Pareto-optimal. FAST is able to find many designs significantly better than the baseline.

### 5.2.4 Example Designs Found by FAST

Table 4 shows two example designs found with FAST when optimizing Perf/TDP on EfficientNet-B7, compared to TPU-v3 normalized to the same sub-10nm process technology. To improve mapping efficiency for depthwise-separable convolutions, both designs have PEs with smaller systolic array dimensions resulting in significantly higher compute utilization. Despite FAST-Large having similar peak compute performance as TPU-v3 at half the peak memory bandwidth, the design is not bandwidth-bottlenecked due to its 128MiB Global Buffer, allowing



Figure 10: FAST-Large post-fusion operational intensity on EfficientNet-B7, sweeping Global Memory and batch size. Due to the large model size, fusion is only effective at lower batch sizes with large global buffers.



Figure 11: Performance breakdown of each component of FAST relative to a TPU-v3 single TensorCore baseline. Improvements are additive; for example, FAST fusion includes both datapath and scheduling improvements. Datapath improvements without fusion are less effective due to the memory bottleneck, which FAST fusion can address.

FAST fusion to improve operational intensity from 63 to 383. Overall, idle time spent waiting for DRAM transfers to complete is reduced by 85%, from 63% pre-fusion to 9% post-fusion. The FAST-Small design gives up on fusion entirely, instead generating a smaller design with a significantly lower compute to memory bandwidth ratio.

### 5.2.5 Evaluating FAST Fusion

We evaluate FAST fusion performance by measuring its impact on operational intensity as we sweep Global Memory and batch size in an otherwise fixed FAST-Large design. Note that increasing batch size also increases activation size (see Table 1), reducing fusion efficiency as fewer tensors can be kept in on-chip memory. Therefore, the largest EfficientNet model (B7) represents a worst-case scenario for FAST fusion. Nonetheless, given a combination of smaller batch size and larger Global Memory capacity, FAST fusion can still achieve sufficiently high operational intensity to overcome the memory bottleneck.

### 5.2.6 Ablation Study

To clearly attribute performance gains to each component of our framework, in Figure 11 we show the contributions of improved scheduling, datapath improvements, and FAST fusion. These improvements are additive; for example, datapath improvements also include scheduling improvements. It is important to note the interconnected and synergistic nature of these components. For the datapath, we modified the TPU-v3 systolic arrays from 128x128 to 32x32, increased the number of PEs to match TPU-v3 peak performance, and increased the Global Memory size from 16MB to 128MB. Datapath improvements without fusion result in significantly lower speedups since performance is a function of both compute and memory, and increasing systolic array utilization results in no further improvements once the memory bandwidth limit is reached. Raising this bandwidth limit with FAST fusion allows the improved datapath to realize its utilization improvements. Likewise, enabling FAST fusion without the aforementioned architectural changes results in less than 5% speedup (not shown) due to the small 16MB Global Memory and poor 128x128 systolic array utilization. By jointly optimizing scheduling, datapath, and fusion, FAST enables significantly higher speedups compared to prior work.

### 5.2.7 Results Discussion

While our projected performance results are highly compelling, our optimized designs found through FAST do not support the full generality and feature set provided by designs like TPU-v3 optimized for not just inference, but also training across thousands of devices. However, key production datacenter workloads may be sufficiently important or provide sufficient volume for substantial returns on investment. Specialized designs optimized for small sets of workloads are unlikely to completely replace general-purpose designs, but may still serve an important niche in production environments. By substantially enlarging the set of workloads, FAST may also be used to propose the design of future generations of general-purpose ML accelerators.

### 6. Conclusion

We presented FAST, a full-stack accelerator search technique that performs joint optimization of the hardware datapath, software scheduling, and op fusion. Specialized accelerator designs discovered by FAST achieve on average 4x better Perf/TDP on state-of-the-art vision models compared to a general-purpose TPU-v3 accelerator baseline.

## References

- [1] Mohamed S. Abdelfattah, Lukasz Dudziak, Thomas Chau, Royson Lee, Hyeji Kim, and Nicholas D. Lane. Best of both worlds: Automl codesign of a cnn and its hardware accelerator. In *Proceedings of the 57th ACM/EDAC/IEEE Design Automation Conference*, DAC '20. IEEE Press, 2020.
- [2] Amirali Abdolrashidi, Qiumin Xu, Shibo Wang, Sudip Roy, and Yanqi Zhou. Learning to fuse. 2019.
- [3] M. Alwani, H. Chen, M. Ferdman, and P. Milder. Fusedlayer cnn accelerators. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 1–12, 2016.
- [4] Matthias Boehm, Berthold Reinwald, Dylan Hutchison, Prithviraj Sen, Alexandre V. Evfimievski, and Niketan Pansare. On optimizing operator fusion plans for largescale machine learning in systemml. *Proc. VLDB Endow.*, 11(12):1755–1768, Aug. 2018.

- [5] Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Direct neural architecture search on target task and hardware, 2019.
- [6] Y. Chen, J. Emer, and V. Sze. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. 43rd Annual International Symposium on Computer Architecture (ISCA), pages 367–379, 2016.
- [7] F. Chollet. Xception: Deep learning with depthwise separable convolutions. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1800–1807, 2017.
- [8] J. Choquette, W. Gandhi, O. Giroux, N. Stam, and R. Krashinsky. Nvidia a100 tensor core gpu: Performance and innovation. *IEEE Micro*, pages 1–1, 2021.
- [9] E. Chung, J. Fowers, K. Ovtcharov, M. Papamichael, A. Caulfield, T. Massengill, M. Liu, D. Lo, S. Alkalay, M. Haselman, M. Abeydeera, L. Adams, H. Angepat, C. Boehn, D. Chiou, O. Firestein, A. Forin, K. S. Gatlin, M. Ghandi, S. Heil, K. Holohan, A. El Husseini, T. Juhasz, K. Kagi, R. K. Kovvuri, S. Lanka, F. van Megen, D. Mukhortov, P. Patel, B. Perez, A. Rapsang, S. Reinhardt, B. Rouhani, A. Sapek, R. Seera, S. Shekar, B. Sridharan, G. Weisz, L. Woods, P. Yi Xiao, D. Zhang, R. Zhao, and D. Burger. Serving dnns in real time at datacenter scale with project brainwave. *IEEE Micro*, 38(2):8–20, 2018.
- [10] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam. Shidiannao: Shifting vision processing closer to the sensor. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), pages 92–104, 2015.
- [11] Gerald Gamrath, Daniel Anderson, Ksenia Bestuzheva, Wei-Kun Chen, Leon Eifler, Maxime Gasse, Patrick Gemander, Ambros Gleixner, Leona Gottwald, Katrin Halbig, Gregor Hendel, Christopher Hojny, Thorsten Koch, Pierre Le Bodic, Stephen J. Maher, Frederic Matter, Matthias Miltenberger, Erik Mühmer, Benjamin Müller, Marc E. Pfetsch, Franziska Schlösser, Felipe Serrano, Yuji Shinano, Christine Tawfik, Stefan Vigerske, Fabian Wegscheider, Dieter Weninger, and Jakob Witzig. The SCIP Optimization Suite 7.0. Technical report, Optimization Online, March 2020.
- [12] Michael A. Gelbart, Jasper Snoek, and Ryan P. Adams. Bayesian optimization with unknown constraints. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, UAI'14, page 250–259, Arlington, Virginia, USA, 2014. AUAI Press.
- [13] Daniel Golovin, Greg Kochanski, and John Elliot Karro. Black box optimization via a bayesian-optimized genetic algorithm. 2017.
- [14] Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Karro, and D Sculley. Google vizier: A service for black-box optimization. In *Proceedings of the* 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1487–1495, 2017.
- [15] Google. XLA: Optimizing Compiler for TensorFlow, 2018.
- [16] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
- [17] Cong Hao, Xiaofan Zhang, Yuhong Li, Sitao Huang, Jinjun Xiong, Kyle Rupnow, Wen mei Hwu, and Deming Chen. FP-GA/dnn co-design: An efficient design methodology for iot intelligence on the edge. In 1904.04421, 2019.

- [18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015.
- [19] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
- [20] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and; 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
- [21] Weiwen Jiang, Lei Yang, Edwin Hsing-Mean Sha, Qingfeng Zhuge, Shouzhen Gu, Yiyu Shi, and Jingtong Hu. Hardware/software co-exploration of neural architectures. *CoRR*, abs/1907.04650, 2019.
- [22] Norman P. Jouppi, Doe Hyun Yoon, George Kurian, Sheng Li, Nishant Patil, James Laudon, Cliff Young, and David Patterson. A domain-specific supercomputer for training deep neural networks. *Commun. ACM*, 63(7):67–78, June 2020.
- [23] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. In-datacenter performance analysis of a tensor processing unit, 2017.
- [24] Sheng-Chun Kao, Geonhwa Jeong, and Tushar Krishna. Confuciux: Autonomous hardware resource assignment for dnn accelerators using reinforcement learning, 2020.
- [25] Samuel J. Kaufman, Phitchaya Mangpo Phothilimthana, Yanqi Zhou, and Mike Burrows. A Learned Performance Model for the Tensor Processing Unit. *ML for Systems Workshop at NeurIPS*, page arXiv:2008.01040, Aug. 2020.
- [26] Sun Yuan Kung. Vlsi array processors. ph, 1988.
- [27] I. Kuon and J. Rose. Measuring the gap between FPGAs and asics. *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, 26(2):203–215, 2007.
- [28] H. Kwon, P. Chatarasi, V. Sarkar, T. Krishna, M. Pellauer, and A. Parashar. Maestro: A data-centric approach to understand reuse, performance, and hardware cost of dnn mappings. *IEEE Micro*, 40(3):20–29, 2020.
- [29] Andrew Lavin and Scott Gray. Fast algorithms for convolutional neural networks. In *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*, pages 4013–4021, 2016.

- [30] Sheng Li, Mingxing Tan, Ruoming Pang, Andrew Li, Liqun Cheng, Quoc Le, and Norman P. Jouppi. Searching for fast model families on datacenter accelerators, 2021.
- [31] Yuhong Li, Cong Hao, Xiaofan Zhang, Xinheng Liu, Yao Chen, Jinjun Xiong, Wen mei Hwu, and Deming Chen. Edd: Efficient differentiable dnn architecture and implementation co-search for embedded ai solutions, 2020.
- [32] Guoping Long, Jun Yang, Kai Zhu, and Wei Lin. Fusionstitching: Deep fusion and code generation for tensorflow computations on gpus. arXiv preprint arXiv:1811.05213, 2018.
- [33] W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li. Flexflow: A flexible dataflow accelerator architecture for convolutional neural networks. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 553–564, 2017.
- [34] C. Ma, X. Mu, and D. Sha. Multi-layers feature fusion of convolutional neural network for scene classification of remote sensing. *IEEE Access*, 7:121685–121694, 2019.
- [35] A. Parashar, P. Raina, Y. S. Shao, Y. Chen, V. A. Ying, A. Mukkara, R. Venkatesan, B. Khailany, S. W. Keckler, and J. Emer. Timeloop: A systematic approach to dnn accelerator evaluation. In 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 304–315, 2019.
- [36] Siyang Qin, Alessandro Bissacco, Michalis Raptis, Yasuhisa Fujii, and Ying Xiao. Towards unconstrained end-to-end text spotting. In *Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)*, October 2019.
- [37] Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. PLDI '13, page 519–530, New York, NY, USA, 2013. Association for Computing Machinery.
- [38] B. Reagen, J. M. Hernández-Lobato, R. Adolf, M. Gelbart, P. Whatmough, G. Wei, and D. Brooks. A case for efficient accelerator design space exploration via bayesian optimization. In 2017 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), pages 1–6, 2017.
- [39] Jared Roesch, Steven Lyubomirsky, Marisa Kirisame, Logan Weber, Josh Pollock, Luis Vega, Ziheng Jiang, Tianqi Chen, Thierry Moreau, and Zachary Tatlock. Relay: A high-level compiler for deep learning, 2019.
- [40] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
- [41] Yakun Sophia Shao, Jason Clemons, Rangharajan Venkatesan, Brian Zimmer, Matthew Fojtik, Nan Jiang, Ben Keller, Alicia Klinefelter, Nathaniel Pinckney, Priyanka Raina, Stephen G. Tell, Yanqing Zhang, William J. Dally, Joel Emer, C. Thomas Gray, Brucek Khailany, and Stephen W. Keckler. Simba: Scaling deep-learning inference with multichip-module-based architecture. In *Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture*, MICRO '52, page 14–27, New York, NY, USA, 2019. Association for Computing Machinery.
- [42] H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao, A. Mishra, and H. Esmaeilzadeh. From highlevel deep neural models to FPGAs. In 2016 49th Annual

*IEEE/ACM International Symposium on Microarchitecture* (*MICRO*), pages 1–12, 2016.

- [43] Zhan Shi, Chirag Sakhuja, Milad Hashemi, Kevin Swersky, and Calvin Lin. Learned hardware/software co-design of neural accelerators, 2020.
- [44] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2820–2828, 2019.
- [45] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In *International Conference on Machine Learning*, pages 6105–6114, 2019.
- [46] R. Venkatesan, Y. S. Shao, M. Wang, J. Clemons, S. Dai, M. Fojtik, B. Keller, A. Klinefelter, N. Pinckney, P. Raina, Y. Zhang, B. Zimmer, W. J. Dally, J. Emer, S. W. Keckler, and B. Khailany. Magnet: A modular accelerator generator for neural networks. In 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pages 1–8, 2019.
- [47] Wm A Wulf and Sally A McKee. Hitting the memory wall: Implications of the obvious. ACM SIGARCH computer architecture news, 23(1):20–24, 1995.
- [48] Lei Yang, Zheyu Yan, Meng Li, Hyoukjun Kwon, Weiwen Jiang, Liangzhen Lai, Yiyu Shi, Tushar Krishna, and Vikas Chandra. Co-exploration of neural architectures and heterogeneous asic accelerator designs targeting multiple tasks. In *Proceedings of the 57th ACM/EDAC/IEEE Design Automation Conference*, DAC '20. IEEE Press, 2020.
- [49] Xuan Yang, Mingyu Gao, Qiaoyi Liu, Jeff Setter, Jing Pu, Ankita Nayak, Steven Bell, Kaidi Cao, Heonjae Ha, Priyanka Raina, Christos Kozyrakis, and Mark Horowitz. Interstellar: Using Halide's Scheduling Language to Analyze DNN Accelerators. In 25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Mar 2020.
- [50] Amir Yazdanbakhsh, Christof Angermueller, Berkin Akin, Yanqi Zhou, Albin Jones, Milad Hashemi, Kevin Swersky, Satrajit Chatterjee, Ravi Narayanaswami, and James Laudon. Apollo: Transferable architecture exploration, 2021.
- [51] Amir Yazdanbakhsh, Kiran Seshadri, Berkin Akin, James Laudon, and Ravi Narayanaswami. An evaluation of edge tpu accelerators for convolutional neural networks, 2021.
- [52] Kuan Wang Zhijian Liu Song Han Yujun Lin, Driss Hafdi. Neural-hardware architecture search. In *NeurIPS ML for Systems Workshop*, 2019.
- [53] Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In *Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays*, FPGA '15, page 161–170, New York, NY, USA, 2015. Association for Computing Machinery.
- [54] X. Zhang, W. Jiang, Y. Shi, and J. Hu. When neural architecture search meets hardware implementation: from hardware awareness to co-design. In 2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pages 25–30, 2019.
- [55] X. Zhang, J. Wang, C. Zhu, Y. Lin, J. Xiong, W. Hwu, and D. Chen. Dnnbuilder: an automated tool for building high-performance dnn hardware accelerators for FPGAs.

In 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pages 1–8, 2018.

- [56] Yanqi Zhou, Xuanyi Dong, Berkin Akin, Mingxing Tan, Daiyi Peng, Tianjian Meng, Amir Yazdanbakhsh, Da Huang, Ravi Narayanaswami, and James Laudon. Rethinking codesign of neural architectures and hardware accelerators, 2021.
- [57] Yanqi Zhou, Sudip Roy, Amirali Abdolrashidi, Daniel Wong,

Peter Ma, Qiumin Xu, Hanxiao Liu, Phitchaya Phothilimtha, Shen Wang, Anna Goldie, Azalia Mirhoseini, and James Laudon. Transferable graph optimizers for ml compilers. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, *Advances in Neural Information Processing Systems*, volume 33, pages 13844–13855. Curran Associates, Inc., 2020.