research-article

Open access

Occam: Optimal Data Reuse for Convolutional Neural Networks

Authors:

Ashish Gondimalla,

Jianqiao Liu,

Mithuna Thottethodi,

T. N. VijaykumarAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization, Volume 20, Issue 1

Article No.: 12, Pages 1 - 25

https://doi.org/10.1145/3566052

Published: 16 December 2022 Publication History

All formats PDF

Abstract

Convolutional neural networks (CNNs) are emerging as powerful tools for image processing in important commercial applications. We focus on the important problem of improving the latency of image recognition. While CNNs are highly amenable to prefetching and multithreading to avoid memory latency issues, CNNs’ large data – each layer’s input, filters, and output – poses a memory bandwidth problem. While previous work captures only some of the enormous data reuse, full reuse implies that the initial input image and filters are read once from off-chip and the final output is written once off-chip without spilling the intermediate layers’ data to off-chip. We propose Occam to capture full reuse via four contributions. First, we identify the necessary conditions for full reuse. Second, we identify the dependence closure as the sufficient condition to capture full reuse using the least on-chip memory. Third, because the dependence closure is often too large to fit in on-chip memory, we propose a dynamic programming algorithm that optimally partitions a given CNN to guarantee the least off-chip traffic at the partition boundaries for a given on-chip capacity. While tiling is well-known, our contribution determines the optimal cross-layer tiles. Occam’s partitions reside on different chips, forming a pipeline so that a partition’s filters and dependence closure remain on-chip as different images pass through (i.e., each partition incurs off-chip traffic only for its inputs and outputs). Finally, because the optimal partitions may result in an unbalanced pipeline, we propose staggered asynchronous pipelines (STAPs) that replicate bottleneck stages to improve throughput by staggering mini-batches across replicas. Importantly, STAPs achieve balanced pipelines without changing Occam’s optimal partitioning. Our simulations show that, on average, Occam cuts off-chip transfers by 21× and achieves 2.04× and 1.21× better performance, and 33% better energy than the base case, respectively. Using a field-programmable gate array (FPGA) implementation, Occam performs 6.1× and 1.5× better, on average, than the base case and Layer Fusion, respectively.

1 Introduction

Advances in convolutional neural networks (CNNs) [23, 30, 33, 46] have resulted in highly accurate image classification and recognition. Current commercial applications are speech-based (e.g., voice assistants) that employ other types of neural networks (e.g., Long Short-Term Memory networks and Transformers), so that CNNs currently contribute only 5% to 10% to machine learning workloads [28]. However, emerging image-based applications (e.g., self-driving cars) are poised to increase this contribution, as anticipated by several commercial Deep Learning accelerators (e.g., Movidius, Nervana, and NVIDIA).

A CNN employs numerous filters to identify features that are combined into higher-order features so that each CNN layer applies the current layer’s filters to the previous layer’s output features and outputs the next set of features and, eventually, the final classification output. Each layer applies a set of filters to its input feature map by convolving each filter with the input map in two dimensions – that is, “sliding” the filter along the input’s two dimensions – to extract the corresponding feature. Each layer simply puts together its filter results as its output map. The weights in the filters are trained by reducing the error for the training inputs via back propagation. Like most previous CNN work, this article focuses on the recognition phase. Further, this article targets reducing the latency of recognition as opposed to improving the throughput. The latency goal is important for many interactive scenarios (e.g., self-driving cars) for which trading off latency for throughput is unacceptable [28]. This work also targets reducing transfers, which saves memory energy during offline inference in data centers.

The large number of layers (e.g., 34 in ResNet), and filters per layers (e.g., 256, 512), result in heavy compute and large intermediate data. The heavy compute is somewhat offset by the abundance of regular parallelism amenable to general-purpose graphics processing units (GPGPUs) [30], field-programmable gate arrays (FPGAs) [19, 47], and tensor processing units (TPUs) [28]. Recent work prunes the compute by using 8-bit fixed-point representation [36] or by exploiting zeroes in the data [2, 3]. The TPU optimizes the compute via a novel systolic multiply-add array. However, large intermediate data continues to be a challenge for performance, especially with more training data, as achieving higher accuracy needs deeper networks for better generalization. Because we target data reuse, we describe CNNs in terms of access patterns instead of neurons and synapses. Because CNNs are amenable to prefetching and multithreading, the problem is memory bandwidth and not memory latency [28].

In convolution, as filters “slide over” the input map, each input cell participates in many computations, which are repeated for each filter. Each weight cell in a filter is reused by each input map cell. Further, each layer’s output map is the next layer’s input map. All of this reuse is for one input image and only in the convolution layers, whereas other fully connected (FC) layers do not have any reuse. The former account for more than 85% of execution time in earlier CNNs [30]. Hence, in our study, the latter are used only as the last layer in recent CNNs (e.g., GoogLeNet).

Current practice is to write off-chip each layer’s output map, which is read back in by the next layer, losing interlayer reuse. DianNao [9] and its successors [10, 16, 37] propose to reduce off-chip traffic by placing all of the intermediate data and filters in an on-chip eDRAM, which may be inefficient (e.g., 50 and 21 MB for VGGnet and ResNet). Other works propose to compress the data for the FC [20] and convolutional layers [2, 44]. While such compression reduces both the compute and memory volumes, the all-or-nothing approach works well only if the compressed data fit in on-chip memory and, otherwise, generate off-chip traffic (e.g., recent CNNs may require several MBs even after compression). Such compression destroys compute and data access regularity, hurting the efficiency of GPGPUs, TPUs, and FPGAs (a statistical convolutional neural network [SCNN] employs crossbars for the irregularity).

Exploiting reuse with reasonable on-chip memory to reduce the off-chip traffic is challenging. Capturing full reuse implies that the initial input image and filters are read once off-chip and the final output is written once off-chip without spilling the intermediate layers’ data to off-chip. Because the filters have high reuse, they are held on-chip (e.g., Eyeriss [11]), similar to our on-chip residence though the architectures hold only one layer’s filters at a time and reload each layer’s filters once per input image. Further, capturing interlayer reuse requires holding a layer’s output map on-chip to be read by the next layer. An output map cell depends on many input map cells – the output cell’s dependence parents, due to a dot product sub-computation in each convolution. Many output map cells share the same dependence parents, which provides reuse. These output maps’ dependence ancestors transitively extend to include the corresponding input maps of the earlier layers. Accordingly, Layer Fusion [4], a pioneering work, holds only the ancestors of an output map tile to capture some of the interlayer reuse. While conventional tiling holds only the parents, this cross-layer tiling holds some or all of the ancestors similar to other work [40, 45] (see Section 6).

Despite these significant advances, none of the papers in the literature captures full reuse. To that end, we propose Occam and make the following contributions:

First, the necessary condition to exploit full reuse with the optimum (smallest) amount of on-chip memory is to hold one full input map row or column, whichever is shorter. Due to convolution, each input map cell is a dependence parent of many output map cells along both dimensions. Capturing such two-dimensional reuse requires holding the full shorter dimension. Thus, the necessary condition determines the optimal tile shape. Unlike matrix multiply, a canonical tiling candidate, the tile shape matters for CNNs (see Section 3.1). Layer Fusion’s tile, derived from another work [57], does not satisfy this condition, incurring expensive recomputation triggered by reuse not captured on-chip.

Second, Occam achieves full reuse by satisfying the sufficient condition of holding only the ancestors of a full output row or column all the way through all the layers to the initial input image – that is, the transitive closure of the dependence ancestors, called the dependence closure (receptive field in machine learning). Layer Fusion identifies the closure but not for the full output row (the necessary condition).

Third, because a CNN’s full dependence closure is often too large to fit in one chip’s memory, we partition the CNN into sets of contiguous layers so that each partition’s dependence closure fits on-chip (each partition reads its input map from and writes its output map to off-chip). Layer Fusion chooses suboptimal partitions because its exhaustive search is infeasible for large networks (e.g., \(2^{34}\) choices in ResNet). Instead, we propose a dynamic programming (DP) algorithm that optimally partitions a given CNN to guarantee the least off-chip traffic at the partition boundaries for a given on-chip memory capacity. While tiling is well known, our contribution is determining the optimal cross-layer tiles. Fortunately, Occam’s optimal partitions and tiles preserve CNNs’ regular parallelism, unlike prior compression work.

Finally, the input-stationary approach [11] holds on-chip input/output maps and fetches filters from off-chip, ignoring filter reuse across images (e.g., TPU). Occam’s partitions reside on different chips (e.g., GPU, TPU, or FPGA) available in a multi-accelerator environment such as data centers, forming a pipeline so that a partition’s filters and dependence closure remain on-chip as different images pass through (i.e., each partition incurs off-chip traffic only for its inputs and outputs). Occam amortizes filter loading to asymptotically zero cost over numerous images to achieve full cross-image reuse. While BrainWave [17] holds the filters on-chip, its partitions are ad hoc, do not employ tiling, and may incur pipeline imbalance. Because Occam’s optimal partitions may also result in an unbalanced pipeline, we propose staggered asynchronous pipelines (STAPs) that replicate bottleneck stages to improve throughput by staggering mini-batches across replicas. Importantly, STAPs achieve balanced pipelines without changing Occam’s optimal partitioning.

DP is a widely used meta algorithmic method for diverse problems. While our optimal DP formulation targets on-chip memory, another heuristic formulation [47] targets FPGA compute resources for CNNs. Similarly, PipeDream [21, 22] employs DP to minimize pipeline imbalance in training for a given pipeline depth without any data reuse considerations, whereas Occam optimizes data reuse in inference for a given cache capacity without constraining pipeline depth. Further, because large data is problematic for any CNN hardware architecture, such as GPGPU-, TPU-, or FPGA-based types, Occam is applicable to all architectures, as it targets memory reuse and transfers.

Our simulations show that, on average, Occam cuts off-chip transfers by 21 \(\times\) and achieves 2.04 \(\times\) and 1.21 \(\times\) better performance, and 33% better energy than the base case. Using an FPGA implementation, Occam achieves 6.1 \(\times\) and 1.5 \(\times\) better average performance than the base case and Layer Fusion, respectively.

2 Reuse in CNN

As mentioned in Section 1, a CNN comprises many layers, each of which employs many filters to extract higher-order features. A layer’s output map is the next layer’s input map. In general, each layer’s input map is a cuboid whose height h, width w, and depth n are analogous to the input image’s height and width and channel count, where n is the number of filters in the previous layer (Layer 1 in Figure 1) except for the inputs in which it refers to the 3 colors (R, G, and B).

Fig. 1.

Each filter is another cuboid whose height and width are typically equal, k, and the channel count is the same as the input map’s, n, in the case of Layer 2 filters in Figure 1 (typically, \(k \lt \lt h, w\) ). Each layer’s output map dimensions are \(h\times w \times m\) , where m is the number of filters in the layer. Some layers, called pooling layers, shrink h and w without changing n by summarizing a set of cells (e.g., maximum of four neighboring cells). A cell is a \(1 \times 1 \times 1\) slice of a cuboid – a scalar.

Each output map cell is computed by a dot product of a filter and a subset of cells in the input maps of the same “cuboidal” dimensions as the filter. (See the faint arrows mapping a \(k \times k \times 3\) cuboid from the inputs to filter \(F1\) of same dimensions in Layer 1 in Figure 1). The next output map cell results from the dot product with the filter “slid over” on the input maps by a stride along the height and width dimensions, one at a time, in a two-dimensional convolution. Figure 1 shows the n filters of Layer 1, each of dimensions \(k \times k\times 3\) in various shades of tan/brown. Further, Figure 1 illustrates that each filter contributes to one channel of the next layer’s feature map (shown with the same color shade).

As a result of convolutional sliding, each cell in a layer’s input map, except for those at the boundaries, is reused \(k *k\) times by a filter in its various slid-over positions (assuming that the convolution stride is 1). This reuse is called convolutional reuse. With n filters in a layer, each input map cell contributes to n output cells in the output channel dimension (see Figure 1). Thus, each input map cell is reused by an additional factor of n called cross-filter reuse, for a total reuse of \(k * k * n\) times (e.g., for a layer with 128 3 \(\times\) 3 filters, there is 1,152 times reuse). Further, each layer’s output map is the next layer’s input map, providing another instance of reuse, called interlayer reuse, for a total reuse of \(k*k*n+2\) times.

Past implementations use im2col to transform the convolution into a matrix multiply. To that end, the implementations replicate the input map cells for every slide position of the filters so that each filter can be matrix-multiplied with an input-map tile of the same size. However, the replication causes enormous memory bloat (k-by-k filters cause \(k^2\) redundancy). Even NVIDIA’s GEMM implementations [1] perform convolutions using implicit GEMM, avoiding huge memory bloat. Therefore, our and other recent implementations do not take this approach. However, there may be some replication in the private L1 caches due to parallel execution.

2.1 Reuse in Filters

A layer’s filters are used only in that layer. Because each weight cell in a filter is “slid over” every input map cell in the height and width dimensions, each weight cell is reused \(h * w\) times, ignoring boundaries (typically, h, w \(\gt \gt\) k). To capture this reuse, current practice applies one filter at a time to the entire input map, refetching the input map if too large to be held on-chip, for every filter. In addition to capturing filter reuse, this approach also captures the \(k*k\) times reuse of the input map. However, the approach does not capture all of the reuse in the input map, requiring \((n + 2)*l\) refetches in the case of large input maps for l layers (l is large for deep networks, e.g., 34 in ResNet). Here, we make the simplifying assumption that all layers’ input maps and output maps are the same size. In Section 5, we show results for real CNNs in which this assumption is not true.

2.2 Reuse in Input Map

One way to avoid refetching the input map is by applying all filters to each part of the input map before processing the next part. While this strategy spreads apart each filter’s reuse (across input map parts), many recent CNN architectures instead hold the filters on-chip because the filters are often larger than the input maps in later layers. However, due to their layer-by-layer processing, these architectures hold only one layer’s filters at a time. Thus, each layer’s filters have to be refetched for the next image (i.e., no cross-image reuse as captured by Occam). For example, as noted in Section 1, Eyeriss [11] applies all filters to each input map cell before the next cell, capturing both convolutional reuse and cross-filter reuse a total of \(k*k*n\) times but not interlayer reuse, resulting in \(2*l\) input map refetches for l layers. Residual CNNs, in which a layer may get an additional input map from a previous layer [23], require slightly more refetches: \(2*l+r\) for l layers of which r have one residual input. Recall from Section 1 that full cross-image reuse implies that the initial input (final output) are read (written) once from (to) off-chip without spilling the intermediate layers’ data. Such interlayer reuse is originally proposed in Layer Fusion [4] and later in [35]. To effectively capture the interlayer reuse –and, thus, full reuse – we propose Occam.

Apart from convolution, CNN layers perform a few other simple local computations: (1) In batch normalization, the training phase computes a per-layer mean and variance used to adjust the layer’s output map in a recognition run. (2) Pooling summarizes a few output map cells of a layer into one (e.g., maximum of four neighboring cells). (3) In bias addition, a bias, either one per layer or one per output map cell, is added to a layer’s output map. (4) A rectified linear unit (ReLU) converts negative values into zeros. All of these are local operations done as the output map is produced. As such, they do not significantly change CNNs’ reuse behavior.

3 Occam

Recall from Section 1 that Occam makes four contributions. First, we specify the necessary condition for full reuse in terms of tile shape. Second, we propose an approach called dependence closure, which satisfies the sufficient condition to capture full reuse using the least on-chip memory. Third, because dependence closure for full reuse is often too large, we propose a dynamic programming algorithm that optimally partitions a given CNN while guaranteeing the least off-chip traffic at the partition boundaries for a given on-chip memory capacity. By holding the filters on-chip, our partitions achieve full cross-image reuse. Finally, we propose staggered asynchronous pipelines (STAPs) to balance the pipeline formed by the partitions.

3.1 Necessary Condition

Due to convolutional reuse, each input map cell of a layer is a dependence parent of many output map cells along both dimensions. For full reuse with rectangular filters,¹ we need to hold on-chip at least one full input map row-plane or column-plane, whichever is smaller. A row-plane corresponds to a cuboid of dimensions 1 full row \(\times\) 1 \(\times\) n, where n is the number of channels. This condition identifies the tile shape necessary for full reuse.

To prove this condition, we make the following four simplifying assumptions (removed later): (1) a single input channel, (2) the input tile includes the element at \((0,0)\) , (3) a rectangular tile of dimensions \(X_t \times Y_t\) from the input feature map of dimensions \(X_{max} \times Y_{max}\) , and (4) partial output sums are not stored in the output tiles. To minimize capacity misses, the largest tile that fits in the cache, called a maximal tile, is used.

Proof sketch: To prove by contradiction, we assume that the input tile of dimensions \(X_t \times Y_t\) does not span a full row (or column) (i.e., \(X_t \lt X_{max}\) and \(Y_t \lt Y_{max}\) ) but is able to achieve full-reuse. Figure 2(a) shows the initial tile position. Because of the two-dimensional sliding nature of convolutional reuse, some elements at the boundaries of the input tile are dependence parents for output cells that cannot be computed fully with the current input tile data (black region indicates elements with future reuse). Moreover, such future reuse exists along both the X and Y dimensions.

Fig. 2.

We now show that any tile movement – even a single step in either the X or Y dimensions – rules out full reuse. Without loss of generality, a tile movement in the X dimension by some positive stride d (integer \(d\gt 0\) ) means that the corners of the tile are now at \((d,0), (d,Y_t), (X_t+d,0),\) and \((X_t+d, Y_t)\) . Because of maximal tiles, all elements in the original tile position, but not in the new position, are evicted. The evicted region’s corners are \((0,0), (0,Y_t), (X_t+d-1,0),\) and \((X_t+d-1,Y_t)\) . At least one evicted element is guaranteed to have future reuse; for example, input \((0,Y_t)\) (element in the red region of Figure 2(a)) is needed to compute the output cell \((0,Y_{t+1})\) in the future. Any tile movement would similarly evict a region of the partial output needed later. Such evictions contradict our initial assumption that the tile can achieve full reuse, thereby proving our condition.

Removing our assumptions: (1) Because the convolutional sliding occurs only in the X and Y dimensions and not in the Z dimension of channels, the argument presented earlier holds for multiple channels (the tile is a cuboid). (2) Including \((0,0)\) in the initial tile position ensures full reuse of at least \((0,0) \ldots (X_t+d-1,0)\) along the X-axis tile boundary. Without this assumption, there are even more elements with future reuse at the top and left of the tile. (3) To generalize the tile shape, we observe that because the filters are rectangles or squares in the \(X-Y\) dimensions (e.g., 3 \(\times\) 3), the output element \((0,0)\) requires the input tile to include the three corner elements – \((0,0)\) , \((0, Y_t),\) and \((0,X_t),\) with \(X_{max} \gt X_t \ge 0\) and \(Y_{max} \gt Y_t \ge 0\) – regardless of the tile shape. Now, this proof (for rectangular tiles) holds by ignoring the fourth corner \((X_t, Y_t)\) ; the same element \((0, Y_t)\) that has future reuse remains evicted, leading to suboptimal memory traffic. (4) By saving partially computed output cells, we can eliminate the evicted input element’s future reuse. However, in such a partial-output compute strategy, the output tile must include all appropriate partial output cells along the corners \((X_t,0)(X_t,Y_t)\) and \((0,Y_t),(X_t,Y_t)\) for full reuse. By using a similar proof for inputs, one can show that the output tile must hold a row span of output partial sums to achieve full reuse, shifting the problem from the input to the output. Considering the fact that a layer’s output is also an input to the next layer, we can generalize the necessary condition to only inputs.

Layer Fusion [4, 25] uses tiles from another work [57], which are not the full row-plane (or column-plane) and, hence, do not satisfy our necessary condition. That work casts determining the tile dimensions as an integer linear programming problem, whereas we have shown that the full row-plane is necessary for full reuse. Unlike convolution, matrix multiply, a canonical candidate for tiling, does not have two-dimensional reuse. Assuming that one matrix is held on-chip, any tile shape for the other works. In realistic scenarios, neither matrix can fit on-chip, requiring tiling of both matrices, unlike our problem.

Extending our necessary condition to the case of filters that are not fully filled rectangles (e.g., dilated filters [54] that are checkered with holes) remains a work in progress. For example, with a combination of dilated filters and strides, it is possible that there are holes in the input that are never used. Naïve use of rectangular tiles may result in fetching the data corresponding to the “holes” unnecessarily, thus, being non-optimal. However, in such cases, it may be possible to preprocess the input to eliminate the holes (which are unused parts of the input), in which case the necessary condition may hold for the preprocessed input without holes. Finally, we do not consider entirely non-rectangular filters which are not used in practice.

3.2 Sufficient Condition

Holding only one input map row-plane (or column-plane) is not sufficient for full reuse. Observe that a layer’s output map cell is computed using many input map cells – the output map cell’s dependence parents and that many output map cells share the same dependence parents. Fully capturing the reuse of these common parents in a single layer requires holding all of them on-chip until all dependent output map cells of the layer are computed. Combined with the above necessary condition, this single-layer sufficient condition amounts to holding all input map row-planes that are the dependence parents of an output map row-plane (i.e., k input map row-planes are held if the filters dimensions are k \(\times\) k \(\times\) m). As mentioned in Section 2.2, a layer in a residual CNN gets an additional input map from a previous layer. In such CNNs, an output map row-plane’s dependence parents also include that layer’s relevant output map row-planes.

Assuming that each layer reads the input map from and writes the output map to off-chip, the on-chip memory needs to hold all of the filters and the first k input map row-planes to compute the first output map row-plane (all of the n output map channels). To produce the next output map row-plane, the next set of new input map row-planes, as defined by the convolution stride, replace those input map row-planes that are no longer needed (as shown by updated tile in Figure 2(b)). This strategy, employed in Eyeriss [11], captures \(k* k* n\) times input reuse requiring two times the input map refetches between layers.

One way to avoid this assumption of off-chip traffic between consecutive layers is to hold on-chip only one layer’s input map and output map at a time. Because a layer’s output map is needed only by the next layer, this strategy guarantees full reuse. However, there is often no room on-chip to hold a layer’s filters, input maps, and output maps, losing the massive cross-image reuse of filters captured by Occam.

3.3 Dependence Closure

To avoid all traffic between layers (i.e., full reuse), the single-layer sufficient condition has to be extended to all of the layers’ output maps. The output maps’ dependence ancestors transitively extend to include the corresponding input maps of the earlier layers, as observed by Layer Fusion [4, 25] (the smaller rectangular black-and-gray boxes within each layer in Figure 3). The set of dependence ancestors at each layer is dependent on new output map rows computed (black boxes in Figure 3) along with the filter dimension and the convolution’s stride of the layer. Extending this set across multiple layers forms a dependence sequence of input row-maps. For example, consider Figure 3, in which if all layers have a filter dimension of 3 and stride of 1, one row-plane of output map L3 depends on three row-planes of input map L2. Out of the three input row-planes in L2, two are reused from the previous tile (gray boxes), wherease the remaining row-plane is computed (black box), which, in turn, depends on three row-planes of the previous layer’s input map L1, and so on. The full set of the ancestors of a final output map row-plane or column-plane through all of the layers to the initial input, called the dependence closure, satisfies the all-layer sufficient condition. In residual CNNs, the dependence closure does not change due to the residual input maps, which are also fed as non-residual input maps to a previous layer and, thus, are present already in the closure. As noted in Section 1, while conventional tiling typically holds only the dependence parents, this cross-layer tiling holds all of the dependence ancestors.

Fig. 3.

Under dependence closure, execution proceeds by computing only the required number of output map row-planes in each layer as per the dependence sequence discussed earlier to produce the first row-plane of the final output. While the initial input is read from off-chip and the final output is written off-chip, the full dependence closure data (i.e., the relevant partial output map of all of the layers) is held on-chip. Going from the final output’s top row-plane to the next row-plane, the dependence closure also slides down at each layer by the number of new input-map rows needed (the black boxes in Figure 3). Because of reuse of input row-planes, the dependence closures of consecutive final output row-planes overlap (the gray boxes in Figure 3). Occam captures this overlap by holding on-chip the entire closure until needed and replacing only the unneeded, older parts of the closure with newer parts. Because Layer Fusion does not satisfy our necessary condition, it does not hold the entire closure. Instead, Layer Fusion proposes to recompute the missing parts of the closure or store the evicted values on-chip using extra storage.

The dependence closure discussed earlier assumes a final output tile with one full output row. Although it is possible to start with a file output tile with a single pixel (as described in [18]), we do not consider it because the parallelism of such a dependence closure is abysmal. In order to have sufficient parallel work, we consider only the tiles that span one output row.

To produce the final output’s second row-plane, the first layer discards (1) the top input map row-planes no longer needed, replacing them with the next set of input map row-planes as per the layer’s tile motion and (2) the output map row-planes in the temporary space that are no longer needed, overwriting them with the next set of output map row-planes. Thus, the temporary space acts conceptually as a circular buffer for the relevant input map and output map row-planes, similar to Layer Fusion. This circular buffer repeatedly reuses its space to hold the dependence closure of subsequent final output row-planes, conserving on-chip memory capacity. Later layers follow the same strategy (i.e., each layer has its own circular buffer of size defined by the dependence sequence). Because the computation crosses all layers to produce each final output row-plane, all of the layers’ filters need to be held on-chip along with the dependence closure.

The local operations for batch normalization, pooling, bias addition, and ReLU (see Section 2.2) occur as part of each layer’s computation without affecting the dependence closure.

3.4 Optimal Partition

Real CNNs’ full set of filters and dependence closures are too large to fit in one chip’s on-chip memory, or cache. Therefore, we partition a given CNN into sets of contiguous layers in which each partition’s dependence closure fits in the cache. Each partition reads its input once from and writes its output to off-chip memory. Layer Fusion [4] explores such partitioning via brute-force search, which is infeasible for modern CNNs that may have more than 100 layers. Instead, we propose a DP algorithm to guarantee the least off-chip traffic for a given CNN and cache capacity while ensuring that filters remain cache-resident to capture cross-image filter reuse. We use a running example (Figure 4) for illustration.

Fig. 4.

Our formulation uses the following definitions:

•

\(L_i\) and \(W_i\) define the input feature map data and filters, respectively, for the \(i^{th}\) layer. \(|L_i|\) and \(|W_i|\) define the number of elements in \(L_i\) and \(W_i\) , respectively, independent of data format (e.g., FP32, FP16, INT8). \(L_0\) is the input image. Figure 4(a) shows the four feature maps ( \(L_0\) through \(L_3\) ) and the three layers’ weights ( \(W_0\) through \(W_2\) ) for our example.

•

We assume that the cache can hold C elements ( \(C=1,\!024\) in Figure 4(a)).

•

\(DC(i,j)\) defines the dependence closure of one row-plane of the output feature map in \(L_j\) extending back to the feature map of layer \(L_i\) , where \((0\le i \lt j \le l)\) and \(l+1\) is the number of layers. Thus, the end-to-end dependence closure defined earlier in Section 3.3 has \(|DC(0,l)|\) elements.

•

We define a \(SPAN(i,j)\) as the convolution computations starting with \(L_i\) as the input and ending with \(L_j\) as the output.

For on-chip resident filters, the minimum total footprint of all convolutional layers is the sum of all weights ( \(\Sigma _{i=1}^{n-1}|W_i|\) ) and the full dependence closure ( \(|DC(0,l)|\) ). If this footprint fits in the cache, then optimal operation needs no partitioning. Otherwise, the layers must be partitioned into spans such that (1) each \(SPAN(i,j)\) satisfies the capacity constraint that the dependence closure (i.e., \(|DC(i,j)|\) ) and weights (i.e., \(\Sigma _{k=i}^{j-1} |W_k|\) ) of the span must fit in the cache, and (2) the total amount of data transferred off-chip at the partition boundaries is minimized. Even in the degenerate case of a single-layer span, Occam offers modest, albeit reduced, bandwidth improvements. We later address the final case, in which even a single layer does not fit in the cache.

Definition 1.

For a CNN with n convolutional layers, we define a partition boundary set (PBS) as a set \(P = \lbrace p_1,p_2,\ldots ,p_{k-1}\rbrace\) that specifies a partitioning of the CNN into k spans — SPAN( \(0,p_1\) ), SPAN( \(p_1,p_2\) ), \(\ldots\) , SPAN( \(P_{k-1},l\) ). Here, we assume that the elements of P are strictly increasing integers (i.e., if \(i\lt j\) , then \(p_i \lt p_j\) ) that lie between 0 and n. For uniform naming of partition boundaries, we define \(p_0=0\) and \(p_k = l\) .

In a valid PBS, each span’s footprint fits in the cache:

\begin{align} \forall 0 \le i \lt l, (|DC(p_i, p_{i+1})| + \Sigma ^{p_{i+1}-{1}}_{k=p_{i}}|W_k|) \lt C. \end{align}

(1)

Further, for any SPAN( \(p_i,p_{i+1}\) ) specified by such a PBS, the associated computation may run on a single chip. The only off-chip transfers are (1) reads of the input layer, which transfers \(|L_{p_i}|\) elements from off-chip memory (or upstream chips), and (2) writes of the output layer, which transfers \(|L_{p_{i+1}}|\) to off-chip memory (or downstream chips).

We present a DP algorithm that finds the optimal PBS to minimize the transfers for a given network. Informally, our problem fits DP because optimally partitioning a network into two would require that the “left” and “right” sub-partitions themselves be optimal. This claim is sound because of the optimal substructure property of the problem [13], which can be proved easily by contradiction as follows. If the optimal solution to the \(SPAN(i,j)\) uses suboptimal solutions (i.e., partitions that yield more transfers) to a recursive subproblem, using the optimal solution for the subproblem yields fewer overall transfers than the optimum.

In our DP table \(OP[n,n]\) , each cell \(OP[i,j]\) holds three fields of information on the optimal partition of an arbitrary span from \(L_i\) to \(L_j\) : (1) p, the feature map that marks the optimal point of partition between \(L_i\) and \(L_j\) ; (2) X, the optimal number of transfers; and (3) F, the footprint of the largest sub-span within the span as per the optimal partition. Although the footprints F are only for informational purposes in the DP table, Figure 4(c) shows the DP table for our example in which each cell with three fields \(X,F,p\) represents a span.

DP base case: In the base case, if the entire footprint (filters + DC) of a \(SPAN(i,j)\) fits in a cache (Equation (1)), we initialize the number of transfers as the bare minimum — every element in the input and output layers i and j, respectively, is read and written exactly once (Equation (2)). Because the filter transfers are amortized to zero over multiple images, the filters are not counted in the transfers. Further, because no partitioning is necessary, p is set to null (Equation (3)). Figure 4(b) lists how \(X,F\) are computed for all spans that fit in the cache.

\begin{align} OP[i,j].X &= |L_i| + |L_j| \end{align}

(2)

\begin{align} OP[i,j].p &= null \end{align}

(3)

If single-layer spans ( \(SPAN(i, i+1)\) ) fit in the cache, we initialize \(OP[i,i+1]\) , \(0\le i\lt l\) , using the assignments presented earlier. Figure 4(c) shows with light shading all single-layer spans for the example configuration.

If even a single layer does not fit, then chip-residence for filters is not possible and weights must be included in the transfers (Equation (4)). In this special case, the transfers are a lower bound; depending on the implementation (i.e., its access patterns), either the input or filters may be transferred more than once and the layer remains a partition by itself.

\begin{align} OP[i,i+1].X &= |L_i| + |L_{i+1}| + |W_i| \end{align}

(4)

Table update for other cases: Our DP algorithm solves the optimal partition problem in a bottom-up fashion by increasing the span length beyond the base case of 1. A longer span either fits in the cache or not. The first case is treated exactly as the base case. OP[0,2] and OP[1,3] illustrate this case, as both spans fit in the cache (bottom two rows in Figure 4(b)). Figure 4(e) shows cell values for these spans in middle blue shade. For the second case, we consider every possible partition of \(SPAN(i,j)\) and pick the partition point p that yields the fewest transfers in the two resulting sub-spans, \(SPAN(i,p)\) and \(SPAN(p,j)\) .

\begin{align*} p_{opt} &= p \mid \forall (i \lt k \lt j) \\ &(OP[i,p].X + OP[p,j].X) \le (OP[i,k].X + OP[k,j].X) \end{align*}

The choice of the optimal partition point is shown in Figure 4(d). The algorithm compares the two choices ( \(L_1\) in yellow and \(L_2\) in brown) and chooses \(L_2\) , which results in fewer overall transfers.

We define the solution to longer spans \(SPAN(i,j)\) as

\begin{align} OP[i,j].X &= OP[i,p_{opt}].X + OP[p_{opt},j].X \end{align}

(5)

\begin{align} OP[i,j].p &= p_{opt.} \end{align}

(6)

The above table update step accurately tracks the number of off-chip transfers (i.e., X) of each of the two resulting spans (Equation (5)). Further, saving the partition points (p) facilitates reconstruction of the final optimal PBS (Equation (6)). Figure 4(b) shows the full computation of OP[0,3], which yields the optimal partitions: SPAN(0,2) and SPAN(2,3).

Finally, a divide-and-conquer algorithm, instead of DP, would not be efficient here because of the numerous overlapping sub-problems. For example, the subproblem \(OP(3,6)\) would be revisited when examining larger spans (e.g., \(OP(2,8)\) and \(OP(3,7)\) ). DP avoids such recomputation by memorizing the solutions in the OP table.

Extensions: This algorithm extends easily to handle (1) residual connections such as those used in ResNets [23] and (2) batched computation for inference on a minibatch of multiple images.

Simple Residual connections effectively read and aggregate values from upstream layers, which results in additional transfers. Residual connections interact with Occam in one of two ways: either the residual connection does not span any partition boundary or it does. In the first case, Occam guarantees that the residual reads impose no additional off-chip transfers because the residual values are already in the dependence closure. In the second case, the residual values result in additional transfers as the values must be written out to and read back from memory. These additional transfers require a minor change to Equation (5) as follows.

\begin{equation*} OP[i,j].X = OP[i,p_{opt}].X + OP[p_{opt},j].X + 2\times |L_{source}| \end{equation*}

If residual connections have convolutional layers on the connections, the second case results in two options: the convolution layer on the connection could be either before or after the partition boundary. In such cases, we enumerate both of the choices in the DP table as additional rows and columns. During the table update step to find \(p_{opt}\) , only valid spans are scanned to find the optimal split. A similar approach works for multiple branches with multiple layers to account for all topological orderings at the partition (e.g., bottleneck layers of MobileNetv2).

Batched inference is used commonly in CNNs (say, with a minibatch of b images). The only change is that feature map transfers and footprints (Equation (1)) increase proportionally to b, whereas filter transfers and footprints remain unchanged because the entire minibatch uses the same chip-resident filters. Accordingly, Equation (2) is modified to Equation (7):

\begin{align} OP[i,j].X &= {b\times (} |L_i| + |L_j| {\bf).} \end{align}

(7)

Complexity: The DP algorithm is used to optimize partitions offline (like compiler optimizations). The algorithm is of asymptotic complexity \(O(l^3)\) , as there is potentially \(O(l)\) work to find the optimal partition point for up to \(O(l^2)\) spans in the OP table. In practice, the algorithm’s runtime is less than a second on a laptop, even for the largest network we consider (ResNet-152). In the case of branching, although the size of the OP table scales by total number of orderings (T) increasing the overall complexity to \(O(l^3T^3)\) , the time taken is still in seconds as T is small in comparison with l. Given that an optimal solution can be computed efficiently, exploring heuristics for the problem is probably not warranted.

3.5 Staggered Asynchronous Pipelines (STAPs)

Occam captures the enormous inter-image filter reuse by placing each partition – its filters and dependence closure – in a separate chip (e.g., a GPU, TPU, or FPGA). As mentioned in Section 1, Occam is deployed in multi-accelerator environments such as data centers, where placing partitions on each chip is feasible. The partitions form a multi-chip asynchronous pipeline both to capture full inter-image reuse and to achieve high throughput (see Figure 5). In each stage, the earlier output map cells are written (or communicated) whereas the later cells are produced, so that data transfers are hidden under the producer stage’s heavy computation except for a small initial part (Section 5.2.1). While Occam’s pipeline achieves optimal transfers, other techniques [17, 31] employ ad hoc partitions.

Fig. 5.

Because Occam’s partitions may result in an unbalanced pipeline, throughput is limited by the bottleneck stage. However, latency is unaffected due to asynchronous pipelining if the job arrival rate is under the bottleneck rate (asynchronous stages do not wait for all stages to finish, unlike synchronous stages). For example, in an unbalanced, 4-stage Occam pipeline with 15-35-40-10 latency-units per stage, the latency is 100 units and the throughput is 1/40 \(units^{-1}\) .

Throughput can be increased by replicating the bottleneck stages. In our example, stages 2 and 3 are each replicated for a throughput of one inference per 20 units (Figure 5). These stages use the ith replica for the ith input mini-batch (i.e., parallelize across input mini-batches in a staggered manner and not within each mini-batch). Importantly, STAPs achieve better throughput without changing Occam’s optimal tiles.

We consider other forms of parallelism, typically used in training. Model parallelism splits one or more [27, 29, 49] layers into multiple chips and must (1) copy the input to each chip and (2) merge the chips’ results for the next layer. These overheads are worthwhile to reduce filter update traffic in training but not in inference, where filters are not updated and data parallelism achieves the same effect without the overheads. Both Layer-wise parallelism [27] and HyPar [49] choose between model and data parallelisms for each layer, but model parallelism is not relevant for inference. Data parallelism is orthogonal to STAPs, which achieve higher throughput. For lower latency per image, each mini-batch can exploit data parallelism by replicating the entire pipeline without affecting Occam’s transfer optimality (e.g., Figure 5). Each replica would work on half the mini-batch. Thus, data parallelism does not change Occam’s off-chip transfer optimality. Finally, GPipe [24] exploits pipelined parallelism for training. Gpipe uses heuristic-based partitioning (i.e., not optimal, like Occam) and targets minimizing the training pipeline loop from forward pass to back propagation – a problem that is not relevant to Occam.

4 Methodology

we use two implementations – software based and FPGA based – to evaluate Occam. Because Layer Fusion and Occam target memory reuse and transfers, the technique is orthogonal to compute architecture. We focus on widely deployed GPUs. We also show an FPGA-based implementation as a proof of concept that Occam works for hardware accelerators.

Software implementation: We implement Occam in CUDA. which is used prevalently to implement CNNs. Because Occam requires changes to the CUDA kernels of CNN implementations whereas commercial CUDNN frameworks are not open source, we choose ConvNet [30], an open-source framework. Like the latest CNNs, we use 8-bit integers (INT8), which both reduces the data volume and enhances the compute parallelism compared with 32- or 16-bit floating point data.

While the on-chip memory could be implemented as a software scratch-pad or hardware cache, the latter may incur conflict misses despite high associativity, whereas our partitioning considers only the cache capacity. We lay out each partition’s filters and dependence closure (see Section 3.3) sequentially to fit in the chip’s cache without capacity or conflict misses. We use channel contiguous data layout as it allows for an optimized implementation [1].

We implement our DP algorithm as a stand-alone JavaScript application that takes as input the network parameters and produces the optimal partitions and tile dimensions. We then feed these outputs to ConvNet.

Choice of Simulation: While our CUDA implementation can run on a real GPU, NVIDIA GPUs exhibit undocumented cache behavior that makes it hard to exploit reuse. For example, our tests revealed that both prefetch (up to 256 KB but not more) and early eviction (e.g., data read once but not written is evicted before the second touch within the same kernel and even when the footprint fits in the cache) make capturing reuse all but impossible (also observed in other studies [39]). More importantly, the hardware offers no mechanisms to disable these behaviors when they hurt performance. Therefore, we instead use a simulator (GPGPU-sim [5]) where the cache behaves as expected. Our FPGA implementation does not have these problems (see Section 5.3) as the on-chip memory is a software-controlled scratch-pad.

Simulated system: Our goal is to show that Occam improves performance over the best current system. Such a system uses “mini-batches” to perform batched inference on an accelerator such as Google’s TPU or the latest GPUs. We consider an NVIDIA-Volta–like accelerator with 140 Tops (140 1 Ghz multiply-accumulate units), 18-MB on-chip cache, and 800-GB/s memory bandwidth processing 32-image mini-batches. These resources are shared across a mini-batch of 32 images is shown in Figure 6. However, simulation of entire mini-batches would be impractically slow.

Fig. 6.

For practical simulation times, we carefully scale the simulated GPU to match a slice of the above system that performs a single inference out of the 32-image mini-batch. This scaling reduces the compute bandwidth by a factor of 32 because the work is proportional to the number of inferences. Further, because the simulated GPU incurs instruction overheads that an accelerator does not, the compute bandwidth is increased to 15 \(\times\) the multiply-accumulate units. For example, ResNet requires around 200 multiply-accumulate operations per memory byte and a GPU requires around 3,000 instructions per byte.

For cache sizes and memory bandwidth, the scaling has to ensure that (1) the partition’s filters are chip resident and (2) the filters are shared, in cache capacity and memory bandwidth, across all 32 images in the mini-batch. Accordingly, we scale the cache to hold one image’s feature map data and the full filter data. Because this number varies for each partition, we use a calibration based on AlexNet, which yields a cache size of 3 MB. We validated our scaling methodology by simulating 1, 2, and 4 slices with appropriate scaling and verifying performance within 3% of one another. We use a similar calibration for memory bandwidth, reducing it by a factor of 6 (to around 133 GB/s).

The simulated GPU’s parameters are shown in Table 1.

Table 1.

# Streaming Multiprocessors	56
Pipeline Width	128 (INT8)
Number of Registers / SM	65,536
Shared Memory / SM	98 KB
L2 Cache (shared)	3 MB, 128 B line, 12 Banked, 16 Way Assoc.
Memory Bandwidth	133 GB/s

Table 1. Scaled Hardware Parameters for a Single Image

While our on-chip residence requires multiple chips (STAPs; see Section 3.5), we simulate a single GPU to simplify our implementation. To emulate chip-resident filters in our one-GPU simulations, we pre-touch the filters for each partition (in Occam or Layer Fusion) before the partition executes.

Schemes: We implement three schemes – a layer-by-layer base case, Occam, and Layer Fusion. The filters of any single layer fit in the cache in the base case, which, like Eyeriss [11], capture all reuses except cross-layer, even without tiling. We compare with Layer Fusion, which includes cross-layer reuse (due to partitioning and tiling). As such, Layer Fusion subsumes other techniques that have partitioning alone (e.g., BrainWave [17]). Occam’s chip-resident approach uses multiple chips for a given network. To ensure an equal-cost comparison, our base case uses the same number of chips as Occam for throughput via replication. Recall from Section 3.3 that Layer Fusion either employs re-computation for the evicted parts or uses extra space in the dependence closure to avoid evictions. We choose the latter in our Layer Fusion implementation, as it performs better.

Layer Fusion’s exhaustive search for partitions is infeasible for large networks (e.g., \(2^{34}\) choices in ResNet-34). As such, we use Occam’s optimal partition algorithm for Layer Fusion to compute the partitions with the largest square tile whose dependence closure for a given partition would fit in the cache (a different tile size for each partition). Because the tiles are suboptimal even though the partitions are optimal, Layer Fusion does not capture full reuse. We verified the functional correctness of these schemes by comparing against an unmodified ConvNet.

Benchmarks: Table 2 shows the networks used as benchmarks. We repeat each run thrice to capture any statistical variations (which are little to none). Table 2 also lists the partitions and the tile sizes for each scheme. All benchmark models were trained on an ImageNet dataset [15]. For ResNets, the shortcuts are zero-padded in our evaluation. We run some training runs to generate each network’s filters. Accuracy is not the main objective of Occam, which is studied elsewhere [48, 56].

Table 2.

Network	Layers	Partition boundaries and Tile sizes ( \(P_{start}, P_{end}, TileDim\) )
Network	Layers	Occam ( \(TileDim \times Row Width\) )	Layer Fusion ( \(TileDim \times TileDim\) )
AlexNet	8	(0,8,6)	(0,8,3)
VGG-19	19	(0,10,2)(12,14,13)(16,18,7)	(0,6,20)(6,11,5)(12,14,8)(16,18,7)
ZFNet	8	(0,6,13),(6,8,6)	(0,5,13)(5,8,6)
ResNet-18	18	(0,14,14)	(0,12,4)(12,15,7)
ResNet-34	34	(0,18,3)(18,23,8)(23,28,3)	(0,15,7)(15,21,5)(21,26,5)(26,29,7)
ResNet-50	50	(0,27,2)(27,42,1)(43,45,7)(46,48,7)	(0,21,9)(21,30,5)(30,37,5)(37,42,7) (43,45,7)(46,48,7)
ResNet-101	101	(0,27,2)(27,34,6)(34,42,5)(42,49,6)(49,57,6) (57,64,6)(64,72,6),(72,79,5)(79,87,14)(87,93,7) (94,96,7)(97,99,7)	(0,21,9)(21,30,5)(30,37,5)(37,45,5) (45,52,5)(52,60,5)(60,67,5)(67,75,5) (75,82,5)(82,90,5)(90,93,7)(94,96,7) (97,99,7)
ResNet-152	152	(0,36,6),(36,43,6)(43,51,5)(51,58,6)(58,66,5) (66,73,6,)(73,81,5)(81,88,6)(88,96,5)(96,103,5) (103,111,6)(111,118,6)(118,126,6)(126,133,5) (133,141,5)(141,144,7)(145,147,7)(148,150,7)	(0,21,9)(21,36,8)(36,43,5)(43,51,5) (51,58,5)(58,66,5)(66,73,5)(73,81,5) (81,88,5)(88,96,5)(96,103,5)(103,111,5) (111,118,5)(118,126,5)(126,133,5)(133,141,5) (141,144,7)(145,147,7)(148,150,7)
MobileNetv2	53	(0,48,1)(48,52,8)	(0,21,4)(21,48,4)(48,52,8)
ShuffleNetv2	56	(0,25,1)(25,37,5)(37,45,2)(45,56,2)	(0,19,4)(19,31,5)(31,42,5)(42,45,3)(45,56,2)

Table 2. Benchmark Characteristics

We use the input images provided with ConvNet for all of our runs. We simulate full network execution except for the fully connected layers.

5 Results

We present three sets of results: analytical, simulation, and FPGA. The analytical results show Occam’s optimal tile and partitions for the benchmark networks and the traffic savings as tracked by our algorithm (Section 3.4). The simulation results compare the execution times and energy for Occam and Layer Fusion. The FPGA results compare execution cycles for Occam and the base case.

5.1 Analytical Results

In Table 2, we present the optimal partitions and tile dimensions for our networks for 3-MB on-chip memory. While the tile dimensions for Occam and Layer Fusion are different, we use Occam’s partition algorithm to derive partitions for Layer Fusion (see Section 4). The partitions are shown using the start layer for each (e.g., AlexNet uses a single partition from layers 0 through 8). The tile sizes are shown as triplets of the form \((pBegin, pEnd,\) \(pTileDimension),\) where \(pBegin\) and \(pEnd\) are the layers at the beginning and the end (exclusive) of the partition. Layer Fusion’s tiles are square shaped ( \(TileDim\times TileDim\) ). In Occam, the tile dimension corresponds to the number of full rows ( \(TileDim\times RowWidth\) ). For a realistic capacity of 3 MB, Occam is able to achieve many multi-layer partitions that capture inter-layer reuse.

Recall that each partition’s filters reside on-chip for all of the schemes. Our algorithm calculates the capacity for each partition’s filters and dependence closure. Figure 7 shows this capacity split for ResNet152. For all of our networks, most of the on-chip capacity goes to the filters and a small fraction to the dependence closures. This result highlights the importance of our sufficient condition (our second contribution). The large capacity for the filters saves significant off-chip traffic due to massive cross-image filter reuse (our fourth contribution).

Fig. 7.

5.2 Simulation Results

5.2.1 Performance.

Figure 8 shows the speedup in execution time (latency) (Y-axis) for Occam and Layer Fusion over the baseline for various CNNs (X-axis). In addition to the individual networks, Figure 8 also shows the geometric mean speedup across all networks (rightmost bars). Occam uniformly outperforms the baseline with a mean speedup of 2.04 \(\times\) over all networks. The speedups are higher for the larger, more recent networks, such as the ResNets. These speedups are a direct result of the large reduction in the miss counts achieved by Occam (21 \(\times\) on average) while avoiding instruction bloat (measured miss counts normalized to the measured baseline’s in Table 3). This miss reduction is due to our necessary condition, which leads to our specific tile shape and our dynamic programming algorithm that minimizes off-chip traffic (our first and third contributions).

Fig. 8.

Table 3.

Network	Miss (Measured)		Miss (Model-predict)			Normalized Instructions.
	(Base = 1)					(Base = 1)
	Occam	Layer-Fusion	Base	Occam	Layer Fusion	Occam	Layer-Fusion
AlexNet	0.10	0.10	1.00	0.05	0.05	1.05	1.07
VGG-19	0.10	0.15	1.00	0.06	0.07	1.05	1.15
ZfNet	0.08	0.08	1.00	0.06	0.06	1.05	1.07
ResNet18	0.03	0.06	1.00	0.04	0.04	1.05	1.14
ResNet34	0.03	0.05	1.00	0.03	0.03	1.07	1.17
ResNet50	0.02	0.05	1.00	0.03	0.04	1.08	1.17
ResNet101	0.03	0.04	1.00	0.03	0.04	1.08	1.17
ResNet152	0.03	0.04	1.00	0.03	0.04	1.09	1.18
MobileNetv2	0.02	0.04	1.00	0.03	0.03	1.07	1.15
ShuffleNetv2	0.03	0.04	1.00	0.01	0.02	1.07	1.15
Mean	0.04	0.06	1.00	0.03	0.04	1.07	1.14

Table 3. Normalized LLC Miss and Instruction Counts

Table 3 also shows that our model-predicted miss counts (normalized to the measured baseline’s) closely match those seen in simulation measurements. The only exception is VGGNet where there are a few individual layers that are too large to fit in the cache. Recall that Occam uses a lower-bound estimate for such cases.

Layer Fusion also achieves speedups, 1.7 \(\times\) on average, due to lower traffic. However, its speedups are lower than Occam’s due to its suboptimal tiles with bigger dependence closure that induce more partitions and, thereby, more transfers (only 16 \(\times\) on average). The transfer increase cost is quantified in the normalized miss (Measured) count in Table 3. AlexNet is a special case in which the entire output and its dependence closure fits in the cache for each partition. ZfNet has same number of partitions with the same last layer dimensions. In these two cases, Occam and Layer Fusion are effectively equivalent.

The normalized instruction counts in Table 3 show an instruction overhead arising from tile management in Occam and Layer Fusion. Occam has one-dimensional tile motion resulting in lower overheads whereas Layer Fusion has two-dimensional tile motion with more overheads.

Finally, Occam’s latency penalty for the initial inter-chip transfer has minimal impact on performance (see Section 3.5). The typical-case PCIe latency of 30 \(\mu\) s per partition (PCIe latency varies from 10 \(\mu\) s to 50 \(\mu\) s [55]) results in slightly reducing the average 2.04 \(\times\) speedup to approximately 2.0 \(\times\) . (Subsequent transfers are hidden under computation due to pipelining.) Transferring data to/from on-chip cache to off-chip memory is taken into account in the partition time. We discuss next the energy penalty of inter-chip communication.

5.2.2 Energy.

We did not use GPUWattch [34] due to its several quirks that lead to evidently incorrect results. For example, as part of minimizing the mean-square error with its benchmarks, GPUWattch assigns large scaling factors without any physical rationale (e.g., memory write energy scaled down by \(10^{4}\) ). These factors lead to obviously incorrect results when simulating CNNs (e.g., memory system energy is approximately 0.25% of total energy). Instead, we use TPU’s compute energy of 0.43 pJ/op [28] and GDDR5 DRAM energy of 6 pJ/bit or 48 pJ/B [41] (roughly 100 \(\times\) more expensive than compute [14]).

Figure 9 shows the energy breakdown into compute, memory, and chip-to-chip PCIe components for the baseline, Occam, and Layer Fusion. The PCIe component is not applicable to the baseline, which is not chip resident and runs one network entirely on one chip. For equal cost, the baseline uses the same number of chips as Occam for parallelism (see Section 4). For an equal-cost comparison, we assume chip residence even for Layer Fusion, which then incurs PCIe energy like Occam (Section 4). In the baseline, compute and memory energy are split roughly on average as 65:35. Occam’s compute energy is slightly more than the baseline’s, as shown by the instruction counts (Table 3). Occam drastically cuts memory transfers (21 \(\times\) on average, in Table 3) but incurs the extra energy of chip-to-chip transfers at partition boundaries. The net effect of these factors is a 33% average reduction in energy because the energy cost/bit for DRAM and PCIe are similar (6 pJ/bit [41, 55]). Finally, Layer Fusion’s memory energy saving due to fewer transfers (Table 3) is offset by its extra transfer overhead induced by its suboptimal tiles, resulting in a net energy saving of 30% on average. Layer Fusion’s savings are on par with Occam’s due to the benefits from a DP algorithm and multi-chip pipelining.

Fig. 9.

5.2.3 Sensitivity to Cache Capacity.

To evaluate Occam’s sensitivity to cache capacity and to confirm that Occam’s techniques apply broadly, Figure 10 presents the speedups in latency (Y-axis) for a subset of all networks for increasing cache sizes (bars within each group). For AlexNet and ZfNet, Occam was able to use a single partition even with a 3 MB cache. As such, increasing the cache capacity cannot further improve performance, as verified by our results (left-most bars).

Fig. 10.

The partitions are completely different for different cache sizes, leading to a different number of transfers. In Figure 10, the numbers on top of each indicate cache miss counts (in terms of cache blocks). The miss counts decrease with larger cache, as expected, due to Occam’s larger and fewer optimal partitions leading to fewer misses. In general, a larger cache results in fewer partitions as more layers fit in a partition. The new partitions can also result in corner effects depending on the spare cache capacity after Occam’s partitioning. When there is abundant spare capacity, Occam reduces its tiling overheads by fitting larger tiles. However, low spare capacity (because of packing more layers in each partition) may result in smaller tiles and increased overhead. For ResNet34, we see a slight drop in speedup as the cache capacity increases from 3 MB to 4.5 MB. Indeed, those configurations use smaller tiles and end up with slightly more tiling overheads. VGG-19, ResNet18, ResNet50, and ResNet101, on the other hand, improve their speedup with more cache capacity (also exhibiting small corner effects). Again, this trend is closely correlated with tile size.

5.3 FPGA Results

We use a Terasic DE2-150 FPGA development board with an Intel/Altera Cyclone IV family FPGA. The FPGA has approximately 150K logic elements, 820 KB of on-chip RAM, and over 300 18-bit fixed-point multipliers running at 50 MHz. The FPGA is interfaced to an external SDRAM, which offers 350 MB/s bandwidth. While the logic is adequate for our 64-lane cluster, the limited on-chip memory led us to use thinned neural networks by halving the number of channels and filters. We use Intel’s Quartus Prime (v15.0) with Qsys system builder to integrate our System Verilog implementation (not SystemC) with a soft-processor core (Nios II). Our implementation uses 99K (of the 150K available) LEs and all of the M9K RAM blocks. Like our simulation, we warm up the on-FPGA RAM with each partition’s filters to capture chip residency with a single FPGA (see Section 4).

We implement a single cluster of 64 lanes, each comprising a multiply-accumulate (MAC) unit (using the embedded 18-bit multipliers). Each lane holds a filter subvector (128 elements) fetched from the on-FPGA RAM for Occam (i.e., chip resident). Each input feature map subvector is broadcast to all lanes. Each lane computes the full input map-filter vector-vector product to produce one output cell. Our implementation employs double buffering to hide on-chip and off-chip memory latency. Our implementation uses 99K (of the 150K available) LEs and all of the M9K RAM blocks. The Nios II provides commands for (1) each subvector-subvector multiply and (2) fetching the filter and the input map subvectors. Like our software implementation, we warm up the on-FPGA RAM with each partition’s filters to capture chip residency with a single FPGA (see Section 4).

Figure 11 shows Occam’s and Layer Fusion’s speedups in latency over the baseline using our FPGA implementation for all of our networks. Because the baseline FPGA implementation avoids instruction overheads and is more compute-efficient than a GPGPU, the baseline puts more pressure on memory. Consequently, Occam’s considerable memory traffic reduction, 21 \(\times\) on average (Table 4), results in higher speedups of 6.1 \(\times\) and 1.5 \(\times\) , on average, on top of base case and Layer Fusion, respectively, with the FPGA than with a GPGPU (see Figure 8).

Fig. 11.

Table 4.

Network	AlexNet	Res34	Res101	Geo. Mean
Traffic Reduction	7 \(\times\)	31 \(\times\)	43 \(\times\)	21 \(\times\)

Table 4. Traffic Reduction on FPGA

6 Related Work

There is a plethora of work on CNNs in machine learning (ML) research [56]. While exciting innovations in ML continue to improve CNN accuracy and efficiency, our focus is efficient execution of CNNs. Section 1 covers past work on CNN architectures, such as compute optimizations [2, 3, 19, 28, 30, 36, 47], memory optimizations [10, 16, 20, 37, 44], and reuse optimizations [4, 11]. We have extensively covered Layer Fusion, the work closest to Occam.

Occam’s key contribution is optimal tiling and network partitioning. Use of CNN pipelines is widespread [17, 31]; our contribution is finding the pipeline partitions that optimize off-chip transfers. Tiling [12, 26, 32, 52, 53] is a well-explored area of research. Occam’s optimal tile shape targets convolutional reuse. Diamond tiling [7] addresses tile choices in stencil computations that force pipelined start-up and induce load imbalance. Diamond-shaped tiles address this problem. Time tiling [6] enables tiling of stencil computations on periodic domains. While conventional tiling typically holds only the parents, Layer Fusion and Occam perform cross-layer tiling, which holds some or all of the ancestors. In tile-based fused layers [25], tiling is introduced on top of Layer Fusion but the proposed tiles do expensive re-computation with non-full row span tiles. Occam is evaluated against Layer Fusion with extra storage that performs better by avoiding the expensive re-compute. Work on minimizing off-chip accesses [50] optimizes unaligned bus transfers by choosing appropriate tile shape. However, in dense accelerators such as GPUs, in which entire row-planes are processed at once, we do not observe misaligned accesses for data stored in channel-wise order. Consequently, such tile optimizations are not applicable for Occam.

Acceleration with cross-layer reuse [51] proposes optimized GPU kernels but are limited to two consecutive layers. As the optimization is at shared memory, it can be added for layers within an Occam’s partition to further improve performance. Proposals such as Timeloop [43] perform loop optimizations to reduce data movement without accounting for cross-layer reuse. While Occam reduces movement using cross-layer reuse, Timeloop can be used on top of Occam to reduce data movement from the last-level cache to compute units. Fusion works such as predictive Layer Fusion [42] trade off accuracy for performance. Trading off accuracy pushes back decades of ML research; as such, we do not consider such approaches.

PolyMage [40] and Halide [45] explore cross-stage tiling for image-processing pipeline stages and use heuristics to partition the pipelines. In contrast, Occam employs DP to find the optimal tiles. While Optimus [8] also uses DP to create partitions and DLFusion [38] adopts a greedy partitioning algorithm for parallelism across compute units, they assume a setup in which filters are fetched per inference whereas Occam’s setup exploits inter-image filter reuse with weight transfer amortized to zero over many images.

We covered pipelining work related to STAPs in Section 3.5.

7 Conclusion

This article targeted improving the latency of CNN inference. While CNNs can avoid memory latency problems via prefetching and multithreading, memory bandwidth is a problem due to large data. While there is enormous data reuse, previous work captures only some of this reuse. Full reuse implies that the initial input image and filters are read once from and the final output is written once to off-chip without spilling the intermediate layers’ data. We proposed Occam to capture full reuse via four contributions. First, we identified the necessary condition for full reuse. Second, we proposed dependence closure as the sufficient condition to capture full reuse using the least on-chip memory. Third, because the on-chip memory needed for full dependence closure is often too large, we proposed a dynamic programming algorithm that optimally partitions a given CNN to guarantee the least off-chip traffic at the partition boundaries for a given on-chip memory capacity. While tiling is well known, our contribution is determining the optimal cross-layer tiles for a given on-chip memory capacity. Finally, because Occam’s partitioning may result in unbalanced pipelines, we proposed staggered asynchronous pipelines (STAPs) to improve throughput without perturbing the off-chip-transfer optimality of Occam. Our simulations show that, on average, Occam cuts off-chip transfers by 21 \(\times\) and achieves 2.04 \(\times\) and 1.21 \(\times\) better performance, and 33% better than the base case. Using an FPGA implementation, Occam achieves 6.1 \(\times\) and 1.5 \(\times\) better average performance than the base case and Layer Fusion, respectively. Occam’s simplicity and effectiveness make it an attractive option for CNN-based recognition.

Footnote

We discuss non-rectangular filters at the end of this subsection.

References

[1]

2022. NVIDIA Deep Learning Performance documentation. Retrieved October 11, 2022 from https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html. Updated May 17, 2022.

Abstract

1 Introduction

2 Reuse in CNN

2.1 Reuse in Filters

2.2 Reuse in Input Map

3 Occam

3.1 Necessary Condition

3.2 Sufficient Condition

3.3 Dependence Closure

3.4 Optimal Partition

3.5 Staggered Asynchronous Pipelines (STAPs)

4 Methodology

5 Results

5.1 Analytical Results

5.2 Simulation Results

5.2.1 Performance.

5.2.2 Energy.

5.2.3 Sensitivity to Cache Capacity.

5.3 FPGA Results

6 Related Work

7 Conclusion

Footnote

References

Cited By

Index Terms

Recommendations

Towards dropout training for convolutional neural networks

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Efficient densely connected convolutional neural networks

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations