We first quantize ResNets ranging from 20 to 56 layers deep to see how
Tailor’s accuracy fares under reduced precision. We then evaluate
Tailor’s effects on hardware resources and latency by performing a case study on ResNet-20-style skip connections implemented using the hls4ml architecture, i.e., the designs illustrated in Figure
2. We select this style of skip connection because it is the fundamental building block of ResNets that range from 20 to 110 layers. In our case study, we vary the bit precision and number of filters to see how
Tailor scales up. Based on how
Tailor’s resource reductions scale, designers can understand how
Tailor extrapolates to their own hardware designs. We report latency as well as P&R resource results on the Alveo U200 FPGA accelerator card (part no.
xcu200-fsgd2104-2-e). For end-to-end application results, we evaluate the benefits of
Tailor on two different styles of CNN architectures. The first uses the hls4ml tool to generate architectures. The second is the Reconfigurable DNN Engine—a 2D array of processing elements. Both styles of architectures are described in Section
3.1.
4.2.2 FPGA Evaluation.
Our first study looks solely at one ResNet block. The second study performs an end-to-end implementation of ResNet8 and ResNet50.
For our case study on a ResNet skip connection block (see designs in Figure
2), we evaluate
Tailor at
ap_fixed<8,3> and
ap_fixed<16,6> precisions using the hls4ml architecture. Under both bitwidths, we increase the number of filters for all designs from 16 to 32 to 64. This way, we can understand how
Tailor scales with the number of filters. We use hls4ml [
12] to translate these hardware designs into Vivado HLS, targeting the Alveo U200 FPGA accelerator card. hls4ml uses task-level pipelining (i.e., HLS dataflow) for each NN layer or a small group of layers and streams data between dataflow stages using FIFOs. hls4ml also exposes a knob known as a
reuse factor, which determines how often multipliers are reused in a design. To fairly compare our designs as the number of filters increases, we fix the reuse factor to 576. We then synthesize our designs to report P&R resource utilization as well as co-simulation latency results. Lastly, we run the designs on the U200 to verify correctness.
Under 8-bit precision, we find that both
SkipRemover and
SkipShortener reduce resources. Table
4 summarizes our P&R results. Since our model uses 8-bit precision, we see that all of our models exhibit low DSP usage and higher LUT and FF utilization. This is because Vivado HLS maps multiplications on datatypes that are less than 10 bits to LUTs instead of DSPs, as noted by [
2,
48]. It is possible to pack two 8-bit weights into a DSP [
13], but this is out of scope and orthogonal to the effects
Tailor has on hardware. Furthermore, all of the traditional and
Tailor designs use the same amount of BRAMs with respect to the number of filters because here the BRAMs are used solely for on-chip weight storage, which does not differ across design. Nonetheless,
SkipRemover decreases LUT usage by up to 16% and FF usage by up to 11% compared with the traditional design (Figure
10). These resource savings represent the extra hardware needed to implement a skip connection and subsequently the resources saved. As previously mentioned in Section
3.1, the extra dataflow stages that carry out a skip connection are no longer necessary. More importantly,
SkipRemover’s savings scale linearly as the number of filters increases from 16 to 64 (Figure
9).
SkipShortener’s resource reductions present a trade-off, increasing FFs by 2% in exchange for decreasing LUTs by 3% (Figure
10).
SkipShortener lowers LUT utilization because the lifespan of each skip connection lasts only one dataflow stage instead of the traditional two. This means we need not spend extra logic on the dataflow stages needed to copy the skip connections to buffers that last longer than one stage. However, since the shortened skip connection now fully resides in a single dataflow stage (previously described in Figure
2(c)), this requires some extra FFs. This represents the trade-off that
SkipShortener provides at 8-bit precision: some extra FFs for fewer LUTs. These resource trade-offs also scale linearly as the number of filters scales up, as seen in Figure
9.
We find more dramatic resource reductions when we look at our 16-bit designs, as seen in Figure
12. Table
5 summarizes our P&R results. In contrast with our 8-bit designs, at higher precision, our designs rely more on DSPs and BRAMs. This time the BRAMs are used not only to store weights on chip but also to implement the FIFOs that connect the dataflow stages. Therefore, as we tailor the dataflow stages according to each design (e.g.,
SkipRemover or
SkipShortener), the BRAMs now also reflect these changes. At its best,
SkipRemover lowers LUTs by 11%, FFs by 13%, and BRAMs by 13%. Without a skip connection to implement,
SkipRemover uses fewer resources than the traditional design. The DSPs remains unchanged because they are used solely for the convolutional layers’ multiplications and not the skip connection, which is also the case for
SkipShortener.
Similar to the 8-bit designs,
SkipShortener presents a resource trade-off—this time trading a small increase in LUTs (at most 1%) for decreases in FFs and BRAMs. In the best case,
SkipShortener reduces LUTs by 1%, FFs by 4%, and BRAMs by 34%. While
SkipShortener uses fewer LUTs than the traditional case for 32 filters,
SkipShortener pays about a 1% increase in LUTs for 16 and 64 filters in exchange for decreases in FFs and BRAMs. This small disparity is likely an artifact of the heuristics Vivado P&R uses to allocate resources. Again, these resource trade-offs and savings are possible because the shortened skip connections can be implemented within a single dataflow stage due to its reduced lifetime. Table
6 shows that the lifetime of each shortened skip connection is a little less than half the lifetime of the traditional one. With shorter lifetimes, we find that the
SkipShortener’s skip connections’ FIFOs can now be implemented using shift registers instead of BRAMs, which is what the traditional design still uses (Table
6). Shift registers are much more efficient memories compared with BRAMs. As such, it is advantageous to hardware designers to consider how
SkipShortener provides the opportunity to implement skip connections with a more efficient memory architecture such as shift registers. This leads to 30–34% fewer BRAMs than the traditional design, even as the number of filters scales up. While in this case
SkipShortener uses fewer BRAMs than
SkipRemover does,
SkipShortener offsets this difference by using more FFs than
SkipRemover does. For both
SkipRemover and
SkipShortener, resource utilization (and the associated reductions) scale linearly, as seen in Figure
11.
Tailor does not affect latency for hls4ml architectures. As seen in Table
7, for each number of filters, all designs exhibit the same latency according to co-simulation on an Alveo U200. The slight decrease in latency as the number of filters scales is due to an increase in DSPs and a higher degree of parallelism. As discussed in Section
3.1, hls4ml designs pipeline their tasks. The convolutions’ multiplication tasks dominate the overall dataflow latency. The tasks that
SkipRemover eliminates and
SkipShortener implements more efficiently — the skip connection cloning and addition stages — have significantly lower latency than the convolutions and are thus not on the critical path. Therefore, the throughput remains the same.
By shortening skip connections, we reduce their lifespans, which provides an opportunity for simplifying their hardware implementation specifically for hls4ml architectures. However, shortening skip connections is not beneficial for all architectures. As seen in Table
8, shortening skip connections is worse for both GPUs and CPUs because doing so increases off-chip memory accesses. These extra accesses lower throughput by 5% on GPUs and 2% on CPUs. On FPGAs with hls4ml architectures, however, we can modify the architecture to take advantage of shortened skip connections, reducing resource consumption without negatively affecting throughput (Table
8).
We performed two studies to understand how
Tailor performs for end-to-end implementations of ResNet models. The first is ResNet8 from MLPerf Tiny, which was designed in hls4ml [
3,
5]. The second is ResNet50, implemented on the Reconfigurable DNN architecture.
The ResNet8 model targets the Alveo U200. It uses 16-bit fixed-point representation with six integer bits. The reuse factor for the layers was hand-tuned to 72, which directly affects the resource usage and latency of the layers. The reuse factor is one of the more important knobs for design space exploration in hls4ml and is often hand-tuned to maximize resource usage of the platform while optimizing the overall network performance.
Table
9 shows the resource usage results for the ResNet8 model with skip connections, with shortened skip connections, and without skip connections. Removing the skip connections has clear benefits across all the resources. Shortening the skip connections reduces BRAMs while increasing LUTs and FFs. Both the shortened skip connection and the removed skip connection models show improved accuracy over traditional skip connections. In all cases, the latency remains the same, requiring 304,697 cycles running at 100 MHz (approximately 3 ms/inference).
Our second full model case study implemented a Reconfigurable DNN architecture on the ZCU102 development board, which contains a Zynq UltraScale+ MPSoC. The Reconfigurable DNN array is configured to have 7 rows × 96 columns for a total of 672 PEs that support 8-bit inputs and 8-bit weights. Each PE contains a multiplier and an accumulator implemented using DSPs on FPGA fabric. Input pixels and weights are streamed into the engine as AXI-Stream packets. Images are processed in batches of 7 to increase the reuse and reduce memory accesses. The Reconfigurable DNN architecture was synthesized, placed and routed at a clock frequency of 250 MHz on a ZCU102. The architecture with \(7 \times 96 = 672\) PEs used 49,057 LUTs (18%), 81,446 flip flops (15%), 114 BRAMs (13%), and 1344 DSPs (53%) on the FPGA fabric.
We implemented a ResNet50 model with and without skip connections on a 672-element Reconfigurable DNN architecture running on the ZCU102. Table
10 shows the performance of ResNet50. Removing the skip connections largely benefits the performance due to the removal of the
\(1 \times 1\) convolution blocks. Removing the skip connections also removes those layers, which no longer need to be scheduled on the PE array. The results are much better performance in terms of all metrics: approximately 30% increases in frames per second (FPS) and latency and approximately 45% decrease in memory accesses.