Deep neural networks use skip connections to improve training convergence. However, these skip connections are costly in hardware, requiring extra buffers and increasing on- and off-chip memory utilization and bandwidth requirements. In this article, we show that skip connections can be optimized for hardware when tackled with a hardware-software codesign approach. We argue that while a network’s skip connections are needed for the network to learn, they can later be removed or shortened to provide a more hardware-efficient implementation with minimal to no accuracy loss. We introduce Tailor, a codesign tool whose hardware-aware training algorithm gradually removes or shortens a fully trained network’s skip connections to lower the hardware cost. Tailor improves resource utilization by up to 34% for block random access memories (BRAMs), 13% for flip-flops (FFs), and 16% for look-up tables (LUTs) for on-chip, dataflow-style architectures. Tailor increases performance by 30% and reduces memory bandwidth by 45% for a two-dimensional processing element array architecture.

1 Introduction

Convolutional neural networks (CNNs) often rely on skip connections—identity functions that combine the outputs of different layers—to improve training convergence [17, 45]. Skip connections help mitigate the vanishing gradient problem [4, 15] that occurs when training large CNNs, which helps increase the network’s accuracy. Skip connections allow NNs to have fewer filters/weights than architectures that lack skip connections [17], such as VGG [43].

However, skip connections are generally detrimental to hardware efficiency. They have an irregular design that is ill-suited for hardware acceleration. This is due to their long lifetimes, which span several NN layers, increasing memory utilization and bandwidth requirements. This is particularly true in ResNets [17], which introduced skip connections that spanned across five layers: two convolutions, two batch normalizations (BNs), and a ReLU activation [16, 34] (Figure 1(a)). The skip connection involves minimal computation—it is either the identity or a 1 × 1 convolutional layer for scaling—but it extends the necessary lifespan of the input data. Thus, we must store skip connection data for the duration of time needed to compute the five NN layers. In total, a model’s skip connection data accounts for \(\sim\) 10% of its memory bandwidth either on or off chip. Buffering skip connections on chip increases on-chip memory utilization, whereas moving them off chip not only increases off-chip memory bandwidth but also requires extra control logic for scheduling [29, 30].

Fig. 1.

Optimizing skip connections requires careful hardware-software codesign. Skip connections are crucial for model convergence; naïvely removing them to reduce hardware resources leads to low accuracy [32, 50]. Instead, we must codesign how the model is (1) trained and (2) implemented in hardware to achieve a model that is both accurate and resource efficient.

We develop Tailor, a codesign method that gradually alters an NN’s skip connections during training to produce a highly accurate and resource-efficient NN. Our results in Section 4 show that Tailor can remove or shorten skip connections to achieve topologically regular NNs (Figure 1(b) and 1(c)) that substantially reduce hardware resources, reduce memory bandwidth, and increase performance with minimal to no accuracy loss.

Tailor takes an existing pre-trained model and reduces the hardware complexity of its skip connections with minimal to no accuracy loss. Moreover, Tailor exploits the flexiblity of the field-programmable gate array (FPGA) architecture to customize the skip connection memories, which is not possible on a graphics processing unit (GPU) or central processing unit (CPU). Tailor accomplishes this dynamically during retraining in one of two ways: (1) SkipRemover removes the skip connections altogether (Figure 1(c)) to eliminate all associated hardware complications or (2) SkipShortener shortens each skip connection by splitting it into multiple shorter ones (Figure 1(b)).

We evaluate Tailor’s applicability and benefit on ResNets [17, 18] and QuartzNets [23]—two important classes of NNs that contain skip connections of varying lengths. We also study implementing skip connections with an on-chip, dataflow-style FPGA architecture using hls4ml [2, 12] and a two-dimensional (2D) array of multiply-accumulate processing elements. Tailor reduces resource utilization of hls4ml architectures by up to 34% for block random access memories (BRAMs), 13% for flip-flops (FFs), and 16% for look-up tables (LUTs). Tailor increases the performance of two-dimensional (2D) array architecture by 30% and reduces memory bandwidth by 45%.

Tailor’s hardware-software codesign approach reduces hardware complexity and resources by altering skip connections dynamically during retraining. Our contributions are as follows.

•

The Tailor software methodology of removing or shortening skip connections from existing NNs with minimal to no loss in accuracy

•

The Tailor hardware designs that exploit FPGA-specific architecture optimizations, which are not possible on a GPU/CPU, to produce less resource-intensive skip connection implementations

•

Experiments demonstrating that SkipShortener and SkipRemover models are implemented more efficiently with better performance and resource utilization than their traditional skip connection counterparts

•

Public release of the Tailor hardware-software codesign framework [1]

In Section 2, we review related work. In Section 3, we explain how Tailor’s NN alterations optimize the hardware architecture. We then describe Tailor’s two training methods, SkipRemover and SkipShortener, that alter skip connections with little to no loss in accuracy. Section 4 provides training, quantization, and hardware results for SkipRemover and SkipShortener. Section 5 discusses the tradeoffs Tailor presents between accuracy and hardware resource reductions. Section 6 concludes the paper.

2 Background

2.1 Removing Skip Connections

While skip connection removal has been studied before [8, 25, 32, 50, 51], prior work is lacking in several ways: (1) preliminary work [32, 50, 51] only studies shallow models (up to 34 layers); (2) Li et al. [25] do not remove all of the skip connections in the models they evaluate; (3) Ding et al. [8] and Li et al. [25] both have limited architectural evaluations (e.g., GPU and mobile) that do not consider the highly customized skip connection memories enabled by FPGAs; and (4) Ding et al. [8] require starting with an entirely new NN topology whose skip connections are removable.

Monti et al. [32] start with a standard ResNet and introduce a new training method. This method uses an objective function that penalizes the skip connections and phases them out by the end of the training. This technique has only been applied to smaller ResNets (18 to 34 layers) with a small decrease in accuracy between 0.5% and 3%.

Zagoruyko and Komodakis [50] also develop a method for removing skip connections in an NN. They replace skip connections with Dirac parameterization, creating a new NN called DiracNet. The Dirac parameterization is shown in Equation (1),

\begin{align} \text{DiracNet[50]: } y &= \sigma (x+Wx) \end{align}

(1)

\begin{align} \text{ResNet[17]: } y &= x+\sigma (Wx)\,, \end{align}

(2)

where \(\sigma (\cdot)\) is the nonlinear activation function, W is the layer weight matrix, x is the layer input, and y is the layer output. For ease of comparison with ResNets, Equation (2) is simplified to show only one convolutional layer. In fact, skip connections in ResNets hop over more than one convolutional layer, whereas in DiracNets, the identity mapping is over one single convolutional layer. Therefore, the weights and the identity mapping of the input can be folded because \(x+Wx=(I+W)x\) . This change requires DiracNets to widen the NN layers in the ResNets that they started with. The authors showed that their technique could be used to create models with up to 34 layers. Although it works for shallower models, DiracNets show a decrease in accuracy between 0.5% and 1.5% compared with ResNets. In contrast, SkipRemover eliminates skip connections without widening the layers in the NN and does not need to make this accuracy trade-off.

Li et al. [25] develop residual distillation (RD), which is a modified knowledge distillation framework. RD starts with a ResNet as the teacher and a plain CNN without skip connections as the student. Unlike standard knowledge distillation, RD passes the teacher’s gradients to the student during training. This differs from Tailor because RD starts with a student model without skip connections, whereas Tailor gradually modifies a model’s skip connections every few epochs during training without sharing gradients. Moreover, while RD removes all skip connections from models evaluated on simpler datasets such as CIFAR-10 and CIFAR-100 [24], it fails to remove all skip connections in its ImageNet evaluation, leaving 18% of them in the network, which is a costly choice. In our ImageNet evaluation (see Section 4.1), our SkipRemover method removes all skip connections with minimal accuracy loss.

Ding et al. [8] introduce a new model architecture RepVGG, which trains using 3 × 3 convolutional layers that are each skipped over by both a 1 × 1 convolution and an identity connection. At inference time, these connections can be re-parameterized into the parameters of the 3 × 3 convolutional layers. While RepVGG is more accurate than our SkipRemover model, it requires starting from their specialized training model architecture. This is costly to developers who have already trained a model with skip connections on their dataset. Similarly, transferring a pre-trained RepVGG model to a new dataset via transfer learning can be time-consuming given the many different methods [36, 47, 52] to evaluate. As such, Tailor is ideal for these developers because it modifies the skip connections of an existing pre-trained model to be more resource efficient with minimal to no accuracy loss. Developers can leverage the training they have already done and need not start from scratch with a brand new RepVGG architecture.

2.2 Simplifying Skip Connection Hardware

ShuffleNet [28], DiracDeltaNet [48], and DenseNet [20] simplify skip connections by making them concatenative, i.e., they concatenate, rather than add, the skip connection data to the output of a layer. Concatenative skip connections take advantage of the fact that spatially consecutive memory accesses are typically faster than random accesses. This concatenation and off-chip data movement is possible using a simple controller (e.g., DMA engine).

Tailor uses two techniques to simplify the skip connection hardware. SkipRemover eliminates all logic and memory needed for a skip connection, making them less expensive than concatenative skip connections. Careful retraining allows skip connection removal in smaller networks with no degradation in accuracy. For larger networks, SkipShortener shortens the additive skip connections. By reducing their lifespans, the hardware implementation requires fewer resources. SkipShortener is not necessarily simpler than ShuffleNet [28] or DiracDeltaNet [48]. However, these concatenative skip connections have only been evaluated on image classification and object detection tasks. In our work, we demonstrate our SkipRemover and SkipShortener methods on multifarious NNs and classification tasks, namely, image-classifying ResNets of varying depths, DNA-basecalling QuartzNet-5 × 5, and automatic-speech-recognizing QuartzNet-10 × 5. With respect to DenseNet [20], SkipShortener ResNets use much less memory and bandwidth because DenseNet relies on significantly more skip connections throughout its NN. Given an NN with L layers, DenseNet needs the memory and bandwidth to execute \(L(L + 1)/2\) concatenative skip connections compared with SkipShortener ResNets’ mere L skip connections. With so many more skip connections, DenseNet is more expensive for hardware than SkipShortener ResNets.

Finally, all these techniques simplify skip connection hardware from the outset, building their models with modified skip connections and then training them from scratch. Tailor differs because its hardware-aware training method dynamically alters the skip connections every few epochs during training, taking advantage of what the NN has learned with skip connections. Thus, Tailor allows the NN to gradually adapt to shortened skip connections (SkipShortener) or none at all (SkipRemover).

3 Tailor

Skip connections are important for training (to provide good accuracy), yet complicate implementation (requiring additional hardware resources and reducing performance). Tailor modifies skip connections to make their hardware implementation more efficient. Tailor uses a retraining method that gradually alters the network, resulting in little to no loss in accuracy.

3.1 Hardware Design

Figure 2 shows three hardware implementations for NNs with traditional, shortened, and no-skip connections. The implementations correspond to accelerators produced by hls4ml—a tool that translates Python models into high-level synthesis code [11]. hls4ml creates a separate datapath for each layer and performs task-level pipelining across the layers. The layers communicate using first-in first-out buffers (FIFOs; AXI streams). Everything encapsulated by a dashed line resides in one pipeline stage. The inputs are fed into the architecture using a stream, and the results are given as an output stream. The weights are all stored on-chip, and all the internal results are stored on-chip. We evaluate each of these designs on FPGA later in Section 4.2 along with another style of architecture using a 2D array of processing elements. Tailor allows us to trade off between accuracy, performance, and resource usage through co-design of the neural network using hardware-aware training.

Fig. 2.

Figure 2(a) shows the hardware needed to implement a single ResNet’s skip connection. Note that in all of the designs shown in Figure 2, we fuse the batch normalization parameters with the kernel, as is commonly done [21]. To be low latency and high throughput, the design uses task-level pipelining (i.e., the HLS dataflow pragma) for each NN layer, or a small grouping of layers, and streams the data between each dataflow stage using FIFOs. Since FIFOs can only be read from once, skip connections complicate the design. We must spend a dataflow stage on cloning the skip connection data from its input FIFO into two other FIFOs so that it can be read twice for its two datapaths. The first path goes through a collection of convolutional and ReLU layers, and the second stores the data in a FIFO exclusive to skip connections (skip FIFO). Once the data has gone through the first path, we read from the skip FIFO to perform the addition to complete the skip connection’s identity function. As such, implementing a skip connection on chip requires several extra FIFOs for handling the skip connection data. This, in turn, increases on-chip memory resource utilization.

Ideally, we would eliminate the skip connections. As seen in Figure 2(b), without skip connections, we cut the number of dataflow stages in half (no more Clone, Add, or ReLU stages) and use less than half of the requisite FIFOs compared with Figure 2(a). All we need to do is pass the data through the convolutional and ReLU layers. This reduces resource utilization by up to 16% (see Section 4.2).

It may not be possible to remove the skip connections because they are essential for training convergence. In these cases, shortening the skip connections can simplify their hardware implementation. Figure 2(c) shows a modified network with shortened skip connections such that each skip connection’s lifespan resides within a single dataflow stage. We do not need additional dataflow stages to clone skip connection data. The shorter lifespans allow the shortened skip connections to be stored in shift registers, which can be implemented using the more abundant FFs as opposed to BRAMs, which are used in the traditional skip connection’s hardware design. In this way, we exploit the short skip connections’ lifetimes and use simpler, more efficient hardware memories to implement them (see Section 4.2). As such, we achieve a similar architecture to the version without skip connections (Figure 2(b)), and similarly reduce resources spent on additional dataflow stages and FIFOs in Figure 2(a). SkipShortener is thus more resource efficient than the traditional skip connection design. In fact, SkipShortener provides a trade-off between the SkipRemover and traditional designs because it uses more resources than SkipRemover but less than the traditional one (see Section 4.2). However, as we later show in Section 4.1, SkipShortener maintains accuracy in cases in which SkipRemover accuracy drops off. Thus, SkipShortener allows for design space exploration to balance accuracy and resource usage.

When used with hls4ml, Tailor reduces resource consumption without changing the performance. This is a consequence of hls4ml’s dataflow design; the resources we remove are not on the critical path—they are operating in parallel to the critical path. A dataflow design uses task-level pipelining; thus, reducing the resources spent on stages not on the critical path does not help or hurt overall throughput. Based on our Vivado co-simulation results, the clone stage executes in microseconds whereas the convolutional layer executes in milliseconds, an order of magnitude difference. Therefore, removing the clone buffer (Figure 2(b)) or implementing it more efficiently (Figure 2(c)) will not affect the overall dataflow latency because its latency is an order of magnitude less than the convolution’s latency. This means that Tailor’s resource reductions do not increase or decrease latency or throughput for this architecture style, as later shown in Table 7.

Table 1.

Model	Accuracy (%)
ResNet-50	75.85
No skips (from scratch)	58.36
No skips (KD only)	69.40
Residual distillation (RD) \(^\ast\) [25]	76.08
RepVGG-A2 [8]	76.48
SkipRemover	75.36

Table 1. Top-1 Accuracy of ResNet-50 on the ImageNet Dataset

\(^\ast\) RD [25] only removes 82% of the skip connections.

Table 2.

Model	Accuracy (%)
QuartzNet-5 × 5	95.107
No skips (from scratch)	94.475
No skips (KD only)	94.863
SkipRemover	95.086
Shortened skips (from scratch)	95.019
Shortened skips (KD only)	95.016
SkipShortener	94.902

Table 2. Top-1 Accuracy of QuartzNet-5 × 5 on the Oxford Nanopore Reads Dataset

Table 3.

Model	dev-clean WER (%)	dev-other WER (%)
QuartzNet-10 × 5	5.56	16.63
No skips (from scratch)	—	—
No skips (KD only)	—	—
SkipRemover	—	—
Shortened skips (from scratch)	6.40	17.68
Shortened skips (KD only)	7.14	19.95
SkipShortener	7.86	21.16

Table 3. Word Error Rate (WER) of QuartzNet-10 × 5 on LibriSpeech Dataset

This includes clear (“dev-clean”) and noisy (“dev-other”) audio samples. “—” indicates that the model failed to converge.

Table 4.

# filters	LUT			FF			DSP	BRAM
# filters	T	R	S	T	R	S	T/R/S	T/R/S
16	9,984	8,482	9,764	8,654	7,841	8,916	0	18.5
32	19,566	16,512	18,993	16,183	14,506	16,489	0	36.5
64	42,688	36,882	42,121	31,124	27,815	31,850	0	82

Table 4. Place-and-Route Resource Utilization of a Skip Connection Block as the Number of Filters Increases for \(\langle 8,3\rangle\) Precision on an Alveo U200

SkipRemover reduces LUT and FF usage, whereas SkipShortener trades an increase in FFs for a decrease in LUTs. T = Traditional, R = SkipRemover, S = SkipShortener.

Table 5.

# filters	LUT			FF			DSP	BRAM
# filters	T	R	S	T	R	S	T/R/S	T	R	S
16	14,733	13,320	14,933	17,044	14,935	16,438	12	60.5	52.5	42.5
32	28,498	25,330	28,184	32,923	28,747	31,764	48	124	108	84.5
64	55,699	50,074	55,720	64,564	56,263	62,252	192	267.5	235.5	203.5

Table 5. Place-and-Route Resource Utilization of a Skip Connection Block as the Number of Filters Increases for \(\langle 16, 6\rangle\) Precision on an Alveo U200

SkipRemover reduces resources across the board, whereas SkipShortener trades an increase in LUTs for a decrease in FFs and BRAMs. T = Traditional, R = SkipRemover, S = SkipShortener.

Table 6.

Hardware Design	FIFO Depth	FIFO Implementation
Traditional	69	BRAM
SkipRemover	0	—
SkipShortener 1st skip	33	Shift Register
SkipShortener 2nd skip	34	Shift Register

Table 6. FIFO Depths of a Single Skip Connection Hardware Design at 16-bit Precision

SkipRemover has no skip connections; thus, it has no skip connection FIFOs.

Table 7.

# filters	Latency (ms)
# filters	Traditional/SkipRemover/SkipShortener
16	23.38
32	23.05
64	22.39

Table 7. Latency Co-simulation Results of a Skip Connection Block at \(\langle 8,3\rangle\) and \(\langle 16, 6\rangle\) Precision

The latencies for the Traditional, SkipRemover, and SkipShortener designs are the same for each number of filters because they all rely on task-level pipelining that reuses multipliers at the same rate (576 × ).

Another prevalent style of FPGA CNN architecture instantiates a 2D processing element (PE) array and iteratively programs the convolutions and other operations onto that PE array. We call this style of computation a Reconfigurable DNN Architecture. Figure 3 provides an example architecture used in our experiments. We build this architecture using DeepSoCFlow.¹ Following the taxonomy described in [22], the reconfigurable DNN architecture is a 2D array of processing elements that optimally perform standard convolution and matrix multiplication with high data reuse. The dataflow is primarily output stationary while prioritizing maximal weight reuse and also reusing inputs to an extent. The engine performs fixed-point computations, where the input, weight, and output bit widths are adjustable as synthesis parameters, along with the number of rows and columns of processing elements. The weights rotator prefetches the weights of the next iteration into one block of on-chip memory while the other bank delivers weights, rotating them hundreds of times for maximal data reuse. The Pixel Shifter shifts perform vertical convolution. Partial sums are shifted to the PE on the right to compute horizontal convolution. The results are streamed out through the output direct memory access (DMA) to the off-chip memory. The runtime controller would perform the residual addition, quantization, and activation on the processing system side while the engine computes the next iteration. Our implementation uses the ARM processor available in the Zynq chip. FPGAs without processors could instantiate a softcore processor to perform the controller runtime operations.

Fig. 3.

The Tailor optimizations have different effects on the Reconfigurable DNN architecture as compared to hls4ml architecture. The Reconfigurable DNN architecture computes skip connections by loading input data from off-chip memory and performing the required operations upon it (addition, convolution). Thus, unlike in hls4ml, removing a skip connection does not change the architecture; instead, it changes how computations are mapped to that architecture. Skip connection removal eliminates the need to fetch the skip connection data and perform the associated convolution and addition operations. This increases the overall performance, as we describe in Section 4.

3.2 Hardware-Aware Training

It is difficult to modify an NN’s skip connections without reducing accuracy. Naïvely removing all skip connections before or after training an NN is detrimental to its accuracy. Instead, Tailor consists of two training algorithms, SkipRemover and SkipShortener, that gradually alter an NN’s skip connections on the fly—removing or shortening them every few epochs—in order to make them resource efficient. Gradually altering the model during training tempers the performance drop of removing or shortening the skip connections, yielding minimal to no loss in accuracy as well as significant advantages in the hardware implementation, as described above.

Tailor’s iterative learning approach fine-tunes the altered NNs using a compression method known as knowledge distillation (KD) [19]. KD distills the knowledge of a larger, more complex NN (the teacher) into a smaller, simpler NN (the student). While the student model is training, it compares its output with the teacher model’s output and thus learns from the teacher to perform better. KD provides impressive results for compressing NNs for various applications [31, 41, 44]. In traditional KD, the teacher model is already trained and the student model is trained to match the teacher’s behavior by replicating its output. The student achieves this by training with a loss function:

\begin{align} \mathcal {L} = (1-\beta)\mathcal {G}(\ell , s) + \beta \mathcal {H}(t, s), \end{align}

(3)

where \(\mathcal {G}\) and \(\mathcal {H}\) are distance functions, s and t are student and teacher output vectors, respectively, \(\ell\) is the correct label vector, and \(\beta\) is a tunable parameter [19].

With this idea in mind, both SkipRemover and SkipShortener start with two identical pre-trained NNs with traditional skip connections, where one serves as the teacher and the other serves as the student. During the retraining stage, SkipRemover removes a given skip connection every few epochs. SkipShortener takes a similar iterative approach and, every few epochs, splits a given skip connection into multiple shorter ones. The skip connections are removed or shortened starting from the first skip connection encountered in the NN (from the input) to the last.

Figure 4 visualizes both SkipRemover’s (Figure 4(a)) and SkipShortener’s (Figure 4(b)) training algorithms for a ResNet-style NN. During training, we remove (SkipRemover) or shorten (SkipShortener) one of the student’s skip connections every \(\alpha\) epochs. If n is divisible by \(\alpha\) (as in Figure 4), then at epoch n, the student has had \(n/\alpha\) skip connections altered, and we are viewing the next two skip connections to be modified in the student model: the \((n/\alpha) + 1\) st and \((n/\alpha) + 2\) nd. At epoch \(n+\alpha\) , the \((n/\alpha) + 1\) st skip connection is altered (removed under SkipRemover or split into two shorter skip connections under SkipShortener). The NN then trains for \(\alpha\) epochs so that the student model can improve its weights given the latest model topology. Afterwards, at epoch \(n + 2\alpha\) , the \((n/\alpha) + 2\) nd skip connection is similarly altered. During the entire skip modification retraining process, the student uses the KD loss function \(\mathcal {L}\) defined in Equation (3) to learn from the teacher and the true labels. The teacher’s model topology and weights remain fixed during training. Once all skip connections have been altered, the student model continues training under KD for the remaining number of training epochs as defined by the user. Only the student model is used for inference.

Fig. 4.

Tailor is novel because it dynamically transforms skip connections every few epochs during training. This is an instance of hardware-aware training because the skip connections are slowly altered specifically to reduce hardware resources, as previously discussed in Section 3.1. The gradual skip connection alterations allow the NN to take advantage of what it has learned with skip connections so that it can dynamically adapt to shortened skip connections (SkipShortener) or none at all (SkipRemover). Algorithm 1 describes Tailor’s hardware-aware training process.

4 Results

We evaluate Tailor on two popular kinds of NNs that rely on skip connections: ResNets [17] and QuartzNets [23]. We study the effects of Tailor on model accuracy, quantization, and hardware resource utilization.

4.1 Training Results

To evaluate how Tailor affects an NN’s accuracy, we train ResNets and QuartzNets of varying depths using our SkipRemover and SkipShortener algorithms in PyTorch [39]. The ResNets range from 20 to 110 layers and are trained on the CIFAR-10 [24], CIFAR-100 [24], and SVHN [35] datasets. We also evaluate ResNet50, which has a different skip connection topology than standard ResNets, on the ImageNet dataset [7]. The QuartzNets span between 29 and 54 layers. Their structure is determined by the number and lifetimes of their skip connections. For instance, a QuartzNet-10 × 5 has 10 skip connections that each have a lifetime of 5 sets of layers. We train a QuartzNet-5 × 5 on the Oxford Nanopore Reads dataset [42], a DNA basecalling task. We also train a QuartzNet-10 × 5 on the LibriSpeech dataset [37], an automatic speech recognition (ASR) task, which converts speech audio to text. ASR tasks are assessed using word error rate (WER), which measures the percent of words that the model predicted incorrectly. In all of our ResNet and QuartzNet-10 × 5 training experiments, we set \(\alpha = 3\) in Algorithm 1 so that skip connections are removed or shortened every three epochs. For QuartzNet-5 × 5, we set \(\alpha = 1\) instead because it trains better this way. For the ResNets, we set \(\mathcal {G}\) and \(\mathcal {H}\) in Equation (3) to categorical cross entropy and mean-squared error, respectively, and set \(\beta = 0.35\) . For the QuartzNets, we set Equation (3)’s parameters similarly, except for \(\mathcal {G}\) , which we set to connectionist temporal classification loss, which is used to train difficult tasks involving sequence alignment (like DNA basecalling and ASR). Note that in our training results, “Baseline” refers to the unmodified NN counterpart with conventional skip connections.

Figure 5 shows that SkipRemover works well for ResNet-44 and smaller, at times even outperforming its baseline (traditional skip connection model). However, its accuracy drops as the number of layers increases. This indicates that shallower NNs do not need skip connections for these classification tasks, but they become more necessary for deeper networks. SkipShortener mostly outperforms the baseline on all three datasets, even on deep models.

Fig. 5.

4.1.1 Ablation Studies.

We also perform ablation studies in which we remove key parts of Tailor to understand why they are critical to minimizing accuracy loss. One key part of SkipRemover/SkipShortener is the dynamic skip connection removal/shortening that occurs every few epochs during training under KD. We thus take away this dynamic model alteration by first altering the NNs to have either no skip connections or shortened skip connections. These pre-modified NNs are then trained under KD only. Another key part of SkipRemover and SkipShortener is KD. We evaluate how skip-less and shortened-skip NNs perform without KD, training from randomly initialized weights (i.e., from scratch).

For ResNets trained on CIFAR-10, SkipRemover and SkipShortener usually yield better results than either normal training or using KD-only on a statically pre-modified network on CIFAR-10 per Figures 6(a) and 6(b). The difference between all of the approaches in the figures is minimal for smaller models, but it becomes more apparent as NN depth increases. For instance, skip-less ResNet-110 under regular training yields an accuracy of 26.02% versus SkipRemover, which achieves an accuracy of 90.68%, a 64.66% difference. SkipRemover marginally outperforms regular training and KD-only on smaller skip-less models, but performs much better in comparison as the networks deepen. SkipShortener also generally performs better than the other two approaches for shortened skip models. Regular training mostly lags behind both KD and SkipShortener for shortened skip models.

Fig. 6.

For ResNet-50 on ImageNet, we only apply SkipRemover because it uses an irregular skip connection architecture known as a “bottleneck block” to reduce the number of parameters [17]. This block has a skip connection spanning three layers: a 1 × 1 convolution, then a 3 × 3 convolution, then another 1 × 1 convolution (Figure 7(a)). This irregular topology is not optimal for SkipShortener because it requires the majority of the shortened skip connections to pass through extra downsampling 1 × 1 convolutions to match the activation tensor shapes, significantly increasing the number of model parameters. As such, for ResNets with bottleneck blocks, like ResNet-50, we recommend SkipRemover. As seen in Table 1, SkipRemover incurs a 0.49% accuracy loss compared with the traditional ResNet-50. Compared with prior work, such as RD [25] and RepVGG [8], SkipRemover has slightly lower accuracy (at most 1.12% accuracy difference).²

Fig. 7.

Nevertheless, SkipRemover has two advantages compared with these methods. First, SkipRemover removes all skip connections from ResNet-50, whereas RD only removes 82% of them. RD does not remove the 1 × 1 convolution addition used for downsampling (see Figure 7(c)), which is particularly detrimental. In our experiments on hls4ml architectures, Vivado HLS estimates that ResNet-50’s large 1 × 1 convolution skip connection consumes as many resources as the layers it skips over, effectively doubling resource consumption for that skip connection block. Although Vivado HLS has a tendency to overestimate the actual place-and-route (P&R) resource utilization, these estimates demonstrate that performing the 1 × 1 convolution is a nontrivial task that significantly affects resource consumption. Second, SkipRemover removes the skip connections from an existing pre-trained model, whereas RepVGG requires developers to adopt a new model topology (see Figure 7(d)). If developers do not already have a model on hand, RepVGG is a better option. However, if developers already have a ResNet trained for their specific dataset, it is advantageous to use SkipRemover if they can afford a small accuracy loss. This prevents starting from scratch with RepVGG, which could require extensive hyperparameter tuning. Even fine-tuning a pre-trained RepVGG model to a new dataset using transfer learning is time-consuming, as it is unclear which of the many methods [36, 47, 52] would work best. Instead, SkipRemover allows developers to take advantage of their existing work and achieve a more resource-efficient model.

For QuartzNet-5 × 5, the SkipRemover model performs the best—only 0.021% from the baseline (Table 2). These results all have high accuracy likely because DNA basecalling is an easier sequence alignment task (only four classes) and the model is more than sufficient. For a harder ASR task such as LibriSpeech, QuartzNet-10 × 5 fails to converge without skip connections. Since the model must translate audio samples to text, the audio samples can be noisy, making ASR harder. LibriSpeech, in fact, divides its test samples into “dev-clean” for clearly spoken samples and “dev-other” for noisy samples. With such a challenging task, it is not possible to remove the skip connections (like with DNA basecalling). Nonetheless, QuartzNet-10 × 5 performs well under SkipShortener, as it is within 2% of the baseline WER (Table 3). For both QuartzNet-5 × 5 and -10 × 5, the best performing shortened skip connection model was one whose skip connections were shortened first and then trained from scratch. While SkipShortener has minimal accuracy loss for both QuartzNets, we recommend training a model with shortened skip connections from scratch for this task.

Overall, SkipRemover and SkipShortener perform better than either training on randomly initialized weights or training with KD only. For harder tasks such as ASR though, training a shortened-skip model from scratch is a better choice. Nevertheless, the success of SkipRemover and SkipShortener lies in augmenting KD with dynamic skip alterations.

4.2 Hardware Results

We first quantize ResNets ranging from 20 to 56 layers deep to see how Tailor’s accuracy fares under reduced precision. We then evaluate Tailor’s effects on hardware resources and latency by performing a case study on ResNet-20-style skip connections implemented using the hls4ml architecture, i.e., the designs illustrated in Figure 2. We select this style of skip connection because it is the fundamental building block of ResNets that range from 20 to 110 layers. In our case study, we vary the bit precision and number of filters to see how Tailor scales up. Based on how Tailor’s resource reductions scale, designers can understand how Tailor extrapolates to their own hardware designs. We report latency as well as P&R resource results on the Alveo U200 FPGA accelerator card (part no. xcu200-fsgd2104-2-e). For end-to-end application results, we evaluate the benefits of Tailor on two different styles of CNN architectures. The first uses the hls4ml tool to generate architectures. The second is the Reconfigurable DNN Engine—a 2D array of processing elements. Both styles of architectures are described in Section 3.1.

4.2.1 Quantization.

The parameters of a hardware-accelerated NN are typically quantized from floating-point to fixed-point precision [6, 33, 48].

Quantizing deep NNs with minimal accuracy loss is a largely manual and time-consuming task [14]. We use Brevitas [38] to quantize our SkipRemover and SkipShortener ResNets with depths of 20 to 56 from 32-bit floating-point (float32) to 8-bit and 4-bit fixed-point precision on the CIFAR-10 dataset. We modified Tailor’s hardware-aware training algorithm in which the teacher continues to use floating-point representation whereas the student is quantized. This results in the student undergoing quantization-aware training. In Figure 8, we find that SkipShortener ResNets consistently outperform traditional ResNets under Brevitas quantization-aware training by 0.5%. SkipRemover ResNets start to suffer from the lack of bits as they get deeper, with accuracy dropping to random classification for ResNet-56. However, Brevitas is only one of dozens of ways to quantize neural networks [9, 10, 14, 33, 46]; thus, it may be the case that a SkipRemover ResNet-56 requires a different method of quantization to achieve a quantized accuracy similar to its float32 counterpart.

Fig. 8.

4.2.2 FPGA Evaluation.

Our first study looks solely at one ResNet block. The second study performs an end-to-end implementation of ResNet8 and ResNet50.

For our case study on a ResNet skip connection block (see designs in Figure 2), we evaluate Tailor at ap_fixed<8,3> and ap_fixed<16,6> precisions using the hls4ml architecture. Under both bitwidths, we increase the number of filters for all designs from 16 to 32 to 64. This way, we can understand how Tailor scales with the number of filters. We use hls4ml [12] to translate these hardware designs into Vivado HLS, targeting the Alveo U200 FPGA accelerator card. hls4ml uses task-level pipelining (i.e., HLS dataflow) for each NN layer or a small group of layers and streams data between dataflow stages using FIFOs. hls4ml also exposes a knob known as a reuse factor, which determines how often multipliers are reused in a design. To fairly compare our designs as the number of filters increases, we fix the reuse factor to 576. We then synthesize our designs to report P&R resource utilization as well as co-simulation latency results. Lastly, we run the designs on the U200 to verify correctness.

Under 8-bit precision, we find that both SkipRemover and SkipShortener reduce resources. Table 4 summarizes our P&R results. Since our model uses 8-bit precision, we see that all of our models exhibit low DSP usage and higher LUT and FF utilization. This is because Vivado HLS maps multiplications on datatypes that are less than 10 bits to LUTs instead of DSPs, as noted by [2, 48]. It is possible to pack two 8-bit weights into a DSP [13], but this is out of scope and orthogonal to the effects Tailor has on hardware. Furthermore, all of the traditional and Tailor designs use the same amount of BRAMs with respect to the number of filters because here the BRAMs are used solely for on-chip weight storage, which does not differ across design. Nonetheless, SkipRemover decreases LUT usage by up to 16% and FF usage by up to 11% compared with the traditional design (Figure 10). These resource savings represent the extra hardware needed to implement a skip connection and subsequently the resources saved. As previously mentioned in Section 3.1, the extra dataflow stages that carry out a skip connection are no longer necessary. More importantly, SkipRemover’s savings scale linearly as the number of filters increases from 16 to 64 (Figure 9). SkipShortener’s resource reductions present a trade-off, increasing FFs by 2% in exchange for decreasing LUTs by 3% (Figure 10). SkipShortener lowers LUT utilization because the lifespan of each skip connection lasts only one dataflow stage instead of the traditional two. This means we need not spend extra logic on the dataflow stages needed to copy the skip connections to buffers that last longer than one stage. However, since the shortened skip connection now fully resides in a single dataflow stage (previously described in Figure 2(c)), this requires some extra FFs. This represents the trade-off that SkipShortener provides at 8-bit precision: some extra FFs for fewer LUTs. These resource trade-offs also scale linearly as the number of filters scales up, as seen in Figure 9.

Fig. 9.

Fig. 10.

We find more dramatic resource reductions when we look at our 16-bit designs, as seen in Figure 12. Table 5 summarizes our P&R results. In contrast with our 8-bit designs, at higher precision, our designs rely more on DSPs and BRAMs. This time the BRAMs are used not only to store weights on chip but also to implement the FIFOs that connect the dataflow stages. Therefore, as we tailor the dataflow stages according to each design (e.g., SkipRemover or SkipShortener), the BRAMs now also reflect these changes. At its best, SkipRemover lowers LUTs by 11%, FFs by 13%, and BRAMs by 13%. Without a skip connection to implement, SkipRemover uses fewer resources than the traditional design. The DSPs remains unchanged because they are used solely for the convolutional layers’ multiplications and not the skip connection, which is also the case for SkipShortener.

Fig. 11.

Fig. 12.

Similar to the 8-bit designs, SkipShortener presents a resource trade-off—this time trading a small increase in LUTs (at most 1%) for decreases in FFs and BRAMs. In the best case, SkipShortener reduces LUTs by 1%, FFs by 4%, and BRAMs by 34%. While SkipShortener uses fewer LUTs than the traditional case for 32 filters, SkipShortener pays about a 1% increase in LUTs for 16 and 64 filters in exchange for decreases in FFs and BRAMs. This small disparity is likely an artifact of the heuristics Vivado P&R uses to allocate resources. Again, these resource trade-offs and savings are possible because the shortened skip connections can be implemented within a single dataflow stage due to its reduced lifetime. Table 6 shows that the lifetime of each shortened skip connection is a little less than half the lifetime of the traditional one. With shorter lifetimes, we find that the SkipShortener’s skip connections’ FIFOs can now be implemented using shift registers instead of BRAMs, which is what the traditional design still uses (Table 6). Shift registers are much more efficient memories compared with BRAMs. As such, it is advantageous to hardware designers to consider how SkipShortener provides the opportunity to implement skip connections with a more efficient memory architecture such as shift registers. This leads to 30–34% fewer BRAMs than the traditional design, even as the number of filters scales up. While in this case SkipShortener uses fewer BRAMs than SkipRemover does, SkipShortener offsets this difference by using more FFs than SkipRemover does. For both SkipRemover and SkipShortener, resource utilization (and the associated reductions) scale linearly, as seen in Figure 11.

Tailor does not affect latency for hls4ml architectures. As seen in Table 7, for each number of filters, all designs exhibit the same latency according to co-simulation on an Alveo U200. The slight decrease in latency as the number of filters scales is due to an increase in DSPs and a higher degree of parallelism. As discussed in Section 3.1, hls4ml designs pipeline their tasks. The convolutions’ multiplication tasks dominate the overall dataflow latency. The tasks that SkipRemover eliminates and SkipShortener implements more efficiently — the skip connection cloning and addition stages — have significantly lower latency than the convolutions and are thus not on the critical path. Therefore, the throughput remains the same.

By shortening skip connections, we reduce their lifespans, which provides an opportunity for simplifying their hardware implementation specifically for hls4ml architectures. However, shortening skip connections is not beneficial for all architectures. As seen in Table 8, shortening skip connections is worse for both GPUs and CPUs because doing so increases off-chip memory accesses. These extra accesses lower throughput by 5% on GPUs and 2% on CPUs. On FPGAs with hls4ml architectures, however, we can modify the architecture to take advantage of shortened skip connections, reducing resource consumption without negatively affecting throughput (Table 8).

Table 8.

Model	GPU	CPU	FPGA
Traditional skip connections	1 ×	1 ×	1 ×
SkipRemover	1.11 ×	1.03 ×	1 ×
SkipShortener	0.95 ×	0.98 ×	1 ×

Table 8. Normalized Throughput of a ResNet20

The GPU and CPU both were run with batch size = 64, whereas the FPGA was run with batch size = 1. Throughput is normalized column-wise to the top entry. GPU = 1080Ti. CPU = AMD Ryzen 9 5900X. FPGA = Alveo U200. SkipRemover increases GPU and CPU throughput because it decreases off-chip memory accesses. SkipShortener, however, decreases GPU and CPU throughput because it increases off-chip memory accesses. For a fully on-chip, dataflowed FPGA architecture, neither SkipRemover nor SkipShortener have any effect on throughput.

We performed two studies to understand how Tailor performs for end-to-end implementations of ResNet models. The first is ResNet8 from MLPerf Tiny, which was designed in hls4ml [3, 5]. The second is ResNet50, implemented on the Reconfigurable DNN architecture.

The ResNet8 model targets the Alveo U200. It uses 16-bit fixed-point representation with six integer bits. The reuse factor for the layers was hand-tuned to 72, which directly affects the resource usage and latency of the layers. The reuse factor is one of the more important knobs for design space exploration in hls4ml and is often hand-tuned to maximize resource usage of the platform while optimizing the overall network performance.

Table 9 shows the resource usage results for the ResNet8 model with skip connections, with shortened skip connections, and without skip connections. Removing the skip connections has clear benefits across all the resources. Shortening the skip connections reduces BRAMs while increasing LUTs and FFs. Both the shortened skip connection and the removed skip connection models show improved accuracy over traditional skip connections. In all cases, the latency remains the same, requiring 304,697 cycles running at 100 MHz (approximately 3 ms/inference).

Table 9.

	With Skip Connections	Shortened Skip Connections	Without Skip Connections
Accuracy (%)	87.39	87.93	87.62
LUTs	158609	165699	144206
FFs	196012	204914	181768
DSP48s	1083	1083	1043
BRAMs	173	158.5	156

Table 9. MLPerf Tiny ResNet8 Model Implemented Using hls4ml with Skip Connection, with Shortened Skip Connections, and without Skip Connections

Our second full model case study implemented a Reconfigurable DNN architecture on the ZCU102 development board, which contains a Zynq UltraScale+ MPSoC. The Reconfigurable DNN array is configured to have 7 rows × 96 columns for a total of 672 PEs that support 8-bit inputs and 8-bit weights. Each PE contains a multiplier and an accumulator implemented using DSPs on FPGA fabric. Input pixels and weights are streamed into the engine as AXI-Stream packets. Images are processed in batches of 7 to increase the reuse and reduce memory accesses. The Reconfigurable DNN architecture was synthesized, placed and routed at a clock frequency of 250 MHz on a ZCU102. The architecture with \(7 \times 96 = 672\) PEs used 49,057 LUTs (18%), 81,446 flip flops (15%), 114 BRAMs (13%), and 1344 DSPs (53%) on the FPGA fabric.

We implemented a ResNet50 model with and without skip connections on a 672-element Reconfigurable DNN architecture running on the ZCU102. Table 10 shows the performance of ResNet50. Removing the skip connections largely benefits the performance due to the removal of the \(1 \times 1\) convolution blocks. Removing the skip connections also removes those layers, which no longer need to be scheduled on the PE array. The results are much better performance in terms of all metrics: approximately 30% increases in frames per second (FPS) and latency and approximately 45% decrease in memory accesses.

Table 10.

	With Skip Connections	Without Skip Connections
Accuracy (%)	75.85	75.36
Frames per second (FPS)	28.69	37.47
Time per image (s)	0.035	0.027
Latency (s)	0.244	0.187
Memory access per image (Mb)	140.95	92.71

Table 10. ResNet50 Performance with and without Skip Connections on the Reconfigurable DNN Architecture

The architecture has 672 processing elements and runs on the ZCU102 development board at 250 MHz.

5 Discussion

With these results in hand, designers can now consider which accuracy versus resource trade-offs they are willing to make during the hardware-software codesign process.

SkipRemover provides minimal accuracy loss while reducing resource consumption and increasing performance—a win-win scenario. As seen in Section 4.1, SkipRemover ResNet-50 is only 0.49% less accurate than the baseline on ImageNet. However, SkipRemover is less effective on deeper NNs (such as QuartzNet-10 × 5 and ResNet-110). In fact, QuartzNet-10 × 5 fails to converge when trained under SkipRemover. For such deep NNs trained on difficult tasks such as ASR, skip connections are instrumental in training convergence [17]. By removing skip connections, we expect and see a degradation in accuracy for deeper NNs. This degradation is not as drastic for other tasks. For instance, ResNet-110 still converges when trained using SkipRemover, but it is 3.72% less accurate on CIFAR-10 and 9.61% less accurate on CIFAR-100 compared with the original baseline model. We propose this trade-off between NN size and SkipRemover performance as an additional consideration during design space exploration. In response, SkipShortener is more suitable for deeper NNs when SkipRemover is less effective. SkipShortener maintains accuracy comparable to its original skip connection models and reduces resource requirements by up to 34% compared with the traditional skip connection model.

Based on our hls4ml evaluation, designers can extrapolate to their own designs because, as we have shown in Figure 9 and Figure 11, the resource usage and savings scale linearly as the number of filters grows. We have also shown that at the higher 16-bit precision, Tailor provides significant resource reductions; thus, if designers need more precision, Tailor’s savings will follow. If they need lower 8-bit precision, SkipRemover still manages to lower the 8-bit designs’ LUTs by 16% and FFs by 11%. Even SkipShortener decreases LUTs by 3% despite a 2% increase in FFs, though these smaller resource savings are offset by its overall higher accuracy performance compared with SkipRemover. As a result, it is up to the designer to consider how to best apply Tailor’s codesign methods given their accuracy and resource requirements.

5.1 Theoretical Understanding

Prior work investigated why skip connections are so helpful to ResNets. Veit et al. [45] argue that ResNets behave like ensembles of smaller subnetworks that vary in depth and allow the NN to train and converge more easily. Li et al. [26] and Yao et al. [49] show that introducing skip connections makes the NN loss landscape much smoother with less nonconvexity. They show that naïvely removing these skip connections causes an explosion of nonconvexity in the loss landscape, which makes training significantly more difficult. We confirm these results in our ablation studies (Section 4.1), as accuracy indeed drops when skip connections are removed naïvely. With both KD and SkipRemover, we see an improvement in accuracy. Since the student is trying to mimic the teacher’s outputs, it is possible that the teacher’s outputs guide the student in such a way that prevents the loss landscape from becoming less smooth. Theoretical work from Lin and Jegelka [27] has proven that a ResNet with one-neuron hidden layers is a universal approximator. This work suggests that adding more neurons to the hidden layers creates an over-parameterized ResNet. Since stochastic gradient descent performs better in the presence of over-parameterization, having more neurons per hidden layer increases training efficiency, making it easier to converge. This work also argues that a ResNet is essentially a sparse version of a fully connected NN because the identity skip connections create simpler paths within the ResNet, which was similarly posited by Lin and Jegelka [27]. Given that CNNs and ResNets have both been proven to be universal approximators [27, 40], this implies that there exists a set of parameters for a CNN that can mimic a ResNet such that they equal the same function. It is mainly easier to find a well-performing ResNet because Lin and Jegelka [27] showed that one-neuron hidden layers is sufficient for a ResNet to be a universal approximator.

5.2 Future Work

In our work, Tailor has taken removing and shortening skip connections to their extremes: it either fully removes or fully shortens all the skip connections in an NN. It would be worthwhile to understand the accuracy versus resource utilization trade-off under less extreme cases, e.g., removing only half of the skip connections. It would also be interesting to mix SkipRemover and SkipShortener to try to recover accuracy in the instances in which SkipRemover fails. These approaches may help address SkipRemover’s scalability issues and strike a balance between SkipShortener’s high accuracy and SkipRemover’s resource savings and performance improvements.

6 Conclusion

Tailor introduces two new methods, SkipRemover and SkipShortener, that alter NNs with skip connections dynamically during retraining to fit better on hardware, achieving resource-efficient inference with minimal to no loss in accuracy. With SkipRemover, NNs no longer need to rely on skip connections for high accuracy during inference. With SkipShortener, we retrain NNs to use shorter skip connections with minimal to no loss in accuracy. Shortening skip connections is beneficial for hardware architectures generated by the hls4ml tool, as it reduces the skip connection lifetime. We demonstrate FPGA resource consumption reductions of up to 34% for BRAMs, 13% for FFs, and 16% for LUTs. We show that Tailor is also valuable for optimizing 2D PE array architectures. SkipRemover increases performance by 30% and decreases memory bandwidth by 45%. Designers can decide which accuracy versus resource trade-offs offered by SkipRemover and SkipShortener are suitable to their design requirements. As a result, Tailor is another tool in the hardware-software codesign toolbox for designers to use when building customized accelerators.

Acknowledgments

The authors thank the anonymous referees for their valuable comments and helpful suggestions.

Footnotes

https://github.com/abarajithan11/deepsocflow

Ding et al. [8] introduce RepVGG models of varying depths. We compare against RepVGG-A2 because it is about the same size as ResNet-50.

References

[1]

Abstract

1 Introduction

2 Background

2.1 Removing Skip Connections

2.2 Simplifying Skip Connection Hardware

3 Tailor

3.1 Hardware Design

3.2 Hardware-Aware Training

4 Results

4.1 Training Results

4.1.1 Ablation Studies.

4.2 Hardware Results

4.2.1 Quantization.

4.2.2 FPGA Evaluation.

5 Discussion

5.1 Theoretical Understanding

5.2 Future Work

6 Conclusion

Acknowledgments

Footnotes

References

Cited By

Index Terms

Recommendations

Adapting Skip Connections for Resource-Efficient FPGA Inference

Fingerprint image processing acceleration through run-time reconfigurable hardware

Automatic translation of software binaries onto FPGAs

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations