# FireFly: A High-Throughput and Reconfigurable Hardware Accelerator for Spiking Neural Networks

Jindong Li 🕑, Guobin Shen 🕑, Dongcheng Zhao 🕑, Qian Zhang 🕑, Zeng Yi 🕑

Abstract-Spiking neural networks (SNNs) have been widely used due to their strong biological interpretability and high energy efficiency. With the introduction of the backpropagation algorithm and surrogate gradient, the structure of spiking neural networks has become more complex, and the performance gap with artificial neural networks has gradually decreased. However, most SNN hardware implementations for field-programmable gate arrays (FPGAs) cannot meet arithmetic or memory efficiency requirements, which significantly restricts the development of SNNs. They do not delve into the arithmetic operations between the binary spikes and synaptic weights or assume unlimited on-chip RAM resources by using overly expensive devices on small tasks. To improve arithmetic efficiency, we analyze the neural dynamics of spiking neurons, generalize the SNN arithmetic operation to the multiplex-accumulate operation, and propose a high-performance implementation of such operation by utilizing the DSP48E2 hard block in Xilinx Ultrascale FPGAs. To improve memory efficiency, we design a memory system to enable efficient synaptic weights and membrane voltage memory access with reasonable on-chip RAM consumption. Combining the above two improvements, we propose an FPGA accelerator that can process spikes generated by the firing neuron on-the-fly (FireFly). FireFly is implemented on several FPGA edge devices with limited resources but still guarantees a peak performance of 5.53TSOP/s at 300MHz. As a lightweight accelerator, FireFly achieves the highest computational density efficiency compared with existing research using large FPGA devices.

Index Terms—Spiking Neural Networks, Field-programmable gate array, Hardware Accelerator

#### I. INTRODUCTION

**S** PIKING neural networks (SNNs) are considered the third generation of artificial neural networks (ANNs) [1]. They were developed to mimic the operational mechanism in the human brain, where information is communicated via spikes among neurons. Surrogate gradient algorithms have been

Manuscript created January 1, 2023. This work was supported by the Strategic Priority Research Program of the Chinese Academy of Sciences (Grant No. XDB32070100). (Corresponding authors: Qian Zhang; Yi Zeng.)

Jindong Li and Qian Zhang are with the Brain-inspired Cognitive Intelligence Lab, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China, and also with the School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China (e-mail: lijindong2022@ia.ac.cn, q.zhang@ia.ac.cn).

Guobin Shen is with the Brain-inspired Cognitive Intelligence Lab, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China, and also with the School of Future Technology, University of Chinese Academy of Sciences, Beijing 100049, China (e-mail: shenguobin2021@ia.ac.cn).

Dongcheng Zhao is with the Brain-inspired Cognitive Intelligence Lab, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China (e-mail: zhaodongcheng2016@ia.ac.cn).

Yi Zeng is with the Brain-inspired Cognitive Intelligence Lab, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China, and University of Chinese Academy of Sciences, Beijing 100049, China, and Center for Excellence in Brain Science and Intelligence Technology, Chinese Academy of Sciences, Shanghai 200031, China (e-mail: yi.zeng@ia.ac.cn).

introduced for SNNs tackling nondifferentiable problems to enhance the learning capability of SNNs. [2], [3]. Recent advances in SNNs have demonstrated comparable performance to non-spiking ANNs [4]–[8]. However, compared to the extensive work on ANN accelerators [9]–[11], the existing SNN hardware accelerator still lags, limiting the practical applications of SNNs.

1

Most research ignores the importance of efficiently implementing arithmetic operations in SNN accelerators. In Fieldprogrammable gate array (FPGA) design, using the builtin dedicated hard block to implement arithmetic operations can achieve considerably higher performance than its general logic fabric counterparts. Fabric-only implementations in an arithmetic-extensive application can lead to a compromised clock frequency and even routing failures when the fabric consumption is high. However, in the SNN accelerator design, the register transfer level (RTL) description of the SNN arithmetic operation cannot be automatically synthesized into the dedicated arithmetic hard block. Therefore, most SNN accelerators adopt the fabric-only implementation without further optimizations. Although a single arithmetic operation unit in an SNN accelerator consumes considerably fewer resources than a multiply-accumulate (MAC) unit in an ANN accelerator design, hardware optimization of such operation can still significantly impact the system's performance when the unit is instantiated hundreds or even thousands of times. In the Xilinx Ultrascale FPGA, the dedicated arithmetic hard block, or the DSP48E2, enhances the speed and efficiency of many operations, including multiplication, addition, wide bus multiplexing, pattern detection, and single instruction multiple data (SIMD) operations. It is possible to generalize the SNN computation to the arithmetic operations that the DSP48E2 can provide.

Another important aspect of the SNN accelerator design is the memory system. When scaling the parallelism, the memory bandwidth imbalance between the binary input-output spikes, the multi-bit synaptic weights, and the multi-bit membrane voltage becomes problematic. While the computational complexity and the memory footprint of the binary spikes decrease, the memory access requirements of synaptic weights and membrane voltage do not. The off-chip memory access bandwidth needed by the weights and membrane voltage cannot fully support the increased parallelism brought by the hardware-friendly synaptic operations and storage-friendly binary spikes without further exploration of the reuse mechanism. Most hardware accelerators assume large on-chip memory, store all the synaptic weights, and accumulate membrane voltage on-chip to ease the harsh bandwidth requirement. This method is not scalable, especially when the model gets larger and targets edge FPGA devices. A scalable memory system for synaptic weights and membrane voltage balancing, as well as off-chip data access and on-chip data buffering, should be developed.

At present, most existing neuromorphic hardware or accelerators focus on brain simulation tasks. While these hardware designs claim to support event-driven processing, they are inefficient in terms of resource utilization, computational density, and scalability. In real-world SNN applications, it is not feasible to use overly expensive and large FPGA devices. A lightweight and high-performance SNN accelerator targeting resource-constrained edge scenarios should be developed.

Focusing on these aspects, we propose FireFly, a high throughput and reconfigurable FPGA accelerator that can achieve both arithmetic and memory efficiency. Our contributions can be summarized as follows.

- We generalize the SNN arithmetic operation to the multiplex-accumulate operation and propose a highperformance implementation of such an operation by utilizing the DSP48E2 hard block in Xilinx Ultrascale FPGAs.
- 2) We design a synaptic weight delivery hierarchy and a partial sum and membrane voltage (Psum-Vmem) unified buffer to balance the off-chip memory access bandwidth and on-chip RAM consumption.
- 3) We evaluate multiple deep SNN models on various datasets and achieve faster inference speed and higher classification accuracy than the existing research. We implement FireFly on several commercial off-the-shelf FPGA edge devices with limited resources, bringing hope for real-world SNN applications in edge scenarios.

#### II. RELATED WORK

The existing dedicated neuromorphic hardware designed for SNN can be categorized into three types.

The majority of neuromorphic hardware constructs its hardware substrates in a Network on Chip fashion. Loihi [12], Tianji Chip [13], Spinnaker [14] and TrueNorth [15] fall into this category. In these hardware designs, neurons are grouped into multiple neurocores, which communicate via spikes through the Network-on-Chip (NoC), and spike messages are scheduled by dedicated routers. These hardware architectures are compatible with the event-driven nature of SNNs, as spike events are generated, transferred, and processed only if the neuron fires. However, these neuromorphic hardware designs place rigid restrictions on the network. The SNN networks are distributed among the neurocores, and the total number of neurons in the model cannot exceed the maximum capacity of the hardware, not to mention the harsh fan-in and fan-out hardware limitations of the network.

The second type of neuromorphic hardware explores emerging devices. The BrainScale [16] developed by Heidelberg University emulated spiking neural networks on analog neuromorphic hardware and achieved several advantages over conventional computers. Some research explores new materials like mem-resistors and optics [17]–[19]. However, the low precision and uncertain nature of the hardware prevent them from being used in practice.

The third type of neuromorphic hardware follows the scheme of the ANN accelerator design except for constructing dedicated hardware for synaptic operations and explores optimal dataflow for SNNs specifically [20]–[26]. These types of work require less area cost and achieve higher computing resource utilization. Fine-grained parallelism of the accelerator can enable high-performance computing of the SNN compared with the sequential spike processing mechanism of the NoC counterparts. This type of hardware has the fewest restrictions on the network models and can quickly adapt to emerging neuromorphic research. FPGA platforms are the ideal choice for this type of hardware due to their flexibility and reconfigurability.

While FireFly belongs to this category, its contributions of FireFly are largely complementary to the existing work.

SyncNN [21] proposed a novel synchronous event-driven SNN reconfigurable inference engine and evaluated multiple SNN models on multiple FPGA devices. Fang et al. [27] proposed a holistic optimization framework for the encoder, model, and architecture design of FPGA-based neuromorphic hardware. However, these designs are based on high-level synthesis, thus inducing large resource redundancy.

Lee et al. [28], [29] and Chen et al. [30] explored spatialtemporal parallelism by unrolling the computations in both the spatial and time dimensions and achieved significant acceleration. However, parallelization across multiple time points violates the time-related sequential nature of the membrane voltage update behavior.

SpinalFlow [25] achieved significant sparsity acceleration by adopting a different input/output spike representation to skip the non-spike computations. SATO [31] achieved highspeed inference by incorporating a temporal-oriented dataflow and a bucket-sort-based dispatcher to balance the workload. However, these techniques only work for temporal coding SNNs, limiting the accuracy of the SNN models.

DeepFire [23] was the first research migrating DSP48E2s into neuron core design. However, they did not delve into the function of DSP48E2 and still induce large fabric overhead.

We argue that with careful register transfer level (RTL) design, focusing on optimizing spatial parallelism on FPGA, adopting regular and simple time-step CNN-like processing, and fully utilizing the multi-function DSP48E2, we can still achieve impressive inference throughput on small FPGA edge devices. FireFly is more applicable in real-world applications where design space exploration is constrained by limited resources.

#### **III. SNN BASICS**

#### A. Spiking Neuron Model

Spiking neurons are the basic units of SNNs, which are connected through weighted synapses and transmit information through binary spikes. Although more complex and detailed neuron models such as Izhikevich [32] and Hodgkin–Huxley [33] can accurately model a biological neuron's behavior, simpler models such as Integrate and Fire (IF) [34] and Leaky Integrate and Fire (LIF) [35] are used more often in current SNN applications.

An IF neuron integrates its inputs over multiple timesteps and generates a spike whenever the integrated membrane voltage surpasses a firing threshold. A LIF neuron acts the same except for the leaky behavior of the membrane voltage. The neural dynamics of a LIF neuron membrane potential ucan be described as:

$$\tau_m \frac{du}{dt} = -u + R \cdot I(t), \quad u < V_{th} \tag{1}$$

where  $V_{th}$  denotes the threshold, I denotes the input current, R denotes the resistance, and  $au_m$  is the membrane time constant. A spike is generated when u reaches  $V_{th}$  and u is reset to resting potential  $u_{rest}$ , which is set to 0 in this work. The membrane potential's neural dynamics can be divided into three phases, and each phase can be described in a discrete computational form::

Input current integration phase. All the presynaptic currents generated by the presynaptic spikes are integrated at each discrete timestep.

$$I[t] = \sum_{j} w_{ij} s_j[t] + b_i \tag{2}$$

where the subscript *i* represents the  $i_{th}$  neuron,  $w_{ij}$  is the synaptic weight from neuron j to neuron i, and  $b_i$  is a bias.

Membrane potential update phase. The membrane potential of each neuron is updated by the integrated presynaptic currents at each timestep.

$$v_i[t] = (1 - \frac{1}{\tau_m})u_i[t] + I[t]$$
(3)

26 27

where  $(1 - \frac{1}{\tau_m}) < 1$  denotes the leaky term, which is ignored when using the IF model.

Output spike generation phase. Whenever the membrane potential reaches the firing threshold, the neuron generates an output spike and resets its membrane potential.

$$(u_i[t+1], s_i[t+1]) = \begin{cases} (v_i[t], 0), v_i[t] < V_{th} \\ (0, 1), v_i[t] \ge V_{th} \end{cases}$$
(4)

In these three phases, we have two key observations. The input current integration phase completely dominates the total computational cost due to the high degree of synaptic connectivity and a large number of neurons. The membrane potential update phase has the harshest storage requirement because the membrane potential is read and written back and forth in every timestep. We will focus on these two aspects in the following sections.

## B. Dataflow and Parallelism Scheme for SCNN

Similar to convolutional neural networks (CNNs), convolutional layers dominate the total computational cost in spiking convolutional neural networks (SCNNs). We mainly focus on the dataflow optimizations of the convolutional layers and show that the dataflow can be migrated to fully connected layers.

Input/Output spike representation varies in different neuromorphic hardware. Most SNN hardware implementations Algorithm 1: Pseudo Code of FireFly Architecture.

**Input:** Given the binary spike map size (H, W), input-output channels  $(C_{in}, C_{out})$ , kernel size  $(K_h, K_w)$ , total timestep T, leaky factor  $\lambda$ , threshold  $V_{th}$  and parallelism factor P. Divide the input output channels into  $(c_i = \lceil \frac{C_{in}}{P} \rceil, c_o = \lceil \frac{C_{out}}{P} \rceil)$  groups. Input:  $T \times c_i$  fragments of  $I[P][H \times W]$  streams, each stream passes the hardware for  $c_o$  times. **Output:**  $c_o \times T$  fragments of  $O[P][H \times W]$  streams. 1 Create buffer for synaptic weights:  $W[P][C_{in}][K_h][K_w];$ 2 Create buffer for Psum/Vmem:  $V[P][H \times W]$ ; 3 for  $p_o \leftarrow 0$  to  $c_o$  do Load Weights:  $W[P][C_{in}][K_h][K_w];$ 4 for  $t \leftarrow 0$  to T do 5 for  $p_i \leftarrow 0$  to  $c_i$  do 6 for  $s \leftarrow 0$  to  $H \times W$  do 7 Unroll and pipeline; 8 for  $o \leftarrow 0$  to P do 9 for  $i \leftarrow 0$  to P do 10  $\boldsymbol{w} = W[o][p_i \times P + i][0 \rightarrow$ 11  $K_h][0 \to K_w];$ i = neighbour(I[i][s]);12  $V[o][s] + = \boldsymbol{w} \cdot \boldsymbol{i};$ 13 end 14 end 15 if  $p_i = c_i - 1$  then 16  $V[o][s] \times = (1 - \lambda);$ 17 if  $V[o][s] > V_{th}$  then 18 V[o][s] = 0, O[o][s] = 1;19 else 20 O[o][s] = 0;21 end 22 if t = T - 1 then 23 24 V[o][s] = 0;end 25 end end 28 end

adopt the Address-Event-Representation (AER) data format to transmit spikes between neurons. The standard AER package for one spike includes the spiking neuron's input location and the spike's timestamp. Although the AER data format is compatible with the event-driven nature of SNNs, multiple bits are needed to express the original single-bit spike event. The logic and storage overhead may not be worth it.

This paper adopts the original single-bit format to represent the binary spikes. At any discrete timestep t in the digitalized SCNN, the output spikes of all the neurons in one channel of the convolutional layer can be considered a timestep snapshot in the form of a binary map [36]. In this case, the inputcurrent integration phase computation process of the SNNs is almost the same as that of the traditional ANNs except for the additional time dimension and the changed operation. The



Fig. 1. FireFly Architecture.

set of computations for the complete SNN convolutional layer that receives a single batch of input can be formulated as a loop nest over these 7 variables. All permutations of these 6 loop variables, except for the timestep variable, are legal. Permutations of the loop variables open up the possibility of different dataflow choices. The tiling of the loop variables opens up the possibility of different parallelism schemes.

Different permutations of the loop variables adopt different kinds of dataflow. Different dataflow schemes for convolution have been extensively studied by Eyeriss [9]. The key consideration is how to minimize data movement and maximize data reuse. In SCNN, synaptic connection weights need to be fetched and membrane voltage needs to be updated at every time timestep, due to the unique time dimension in SNN computation. Therefore, output and weight stationary dataflow can minimize the data movement of the multi-bit membrane voltage and synaptic weight data between on-chip logic and off-chip memory.

Different tiling strategies for the loop variables enable different parallelism schemes. The tiling of the loop variables can induce data reordering or data segmentation. We argue that it is important to keep the input and output spike arrangements the same to enable spikes to be processed in an on-the-fly fashion without complicated data reaarangement. We chose the spatial tiling of the input and output channel dimensions rather than tiling within the same spike feature map to avoid data rearranging or irregular off-chip data access.

Adopting the dataflow and parallelism scheme above, the pseudo-code of the FireFly is described in Algorithm 1.

### IV. HARDWARE ARCHITECTURE

#### A. Architecture Overview

In this section, the digital design of SNNs is discussed in detail. Fig.1 shows the overall system design of FireFly.

FireFly targets heterogeneous Zynq Ultrascale devices. The central processing unit (CPU) of the processing system (PS) acts as the controller for system state control and external memory access. The programmable logic (PL) accelerates the SNN inference.

AXI DataMover IP, instead of AXI DMA IP, enables highthroughput and low-latency data transactions between the offchip DRAM and on-chip memory storage. The unique store and forward feature of AXI DataMover is enabled to allow multiple outstanding requests.

The weight-stationary systolic array is responsible for the acceleration of SNN arithmetic operations. The systolic array consists of several DSP48E2 chains and multiple adder trees. A weight matrix delivery hierarchy is proposed to enable efficient weight loading to the systolic array. Two separate datapaths for convolutional and fully connected layers are designed to generate binary spike vectors for the systolic array. A Psum-Vmem unified buffer and update engine is constructed to support back-and-forth membrane potential update and IF/LIF neuron dynamics. An optional MaxPooling unit is placed on the output spike datapath to support on-the-fly pooling.

The designs of the systolic array, the spike vector generation unit, the synaptic weight delivery hierarchy, and the Psum-Vmem unified buffer are elaborated in detail below.

#### B. Synaptic Operations Featured by DSP48E2

As shown in Fig.2A, DSP48E2 is the dedicated digital signal processing logic block in the Xilinx Ultrascale series FPGA. Most FPGA neuromorphic hardware simply treats them as multipliers and leaves them underutilized. However, they enhance the speed and efficiency of many applications far beyond multiplication-based digital signal processing [37].

Considering customizing arithmetic operations for the SNN model, the mathematical dot product operation between the binary spike and the synaptic weight can be modeled as a multiplex-accumulate operation which in this paper, we call the synaptic operation. The spike acts like the control signal of the multiplexer, switching the synaptic weight on or off when the neuron is firing or resting. The following adder sums up all the synaptic weights coming from the firing neuron.



CARRYCASCOUT\*



Fig. 2. Implementing Synaptic operations Using DSP48E2. A) The functional circuit diagram of a single DSP48E2 slice [37]. B) A simplified functional circuit diagram of the DSP48E2 performing spike-based computations. C) An equivalent circuits of the DSP48E2 when SIMD mode is enabled.

In traditional ANNs, one operation usually refers to one two-operand multiplication or two-operand addition. In SNNs, we define one synaptic operation as one 2:1 multiplexing or two-operand addition. We show that the dedicated DSP48E2 unit can provide up to 16 synaptic operations at high speed. This technique is described in detail below.

When the first stage multiplier in DSP48E2 is disabled, ALUMODE control bits are all cleared and carry inputs are ignored, the simplified DSP slice operation shown in Fig. 2B in the ALU stage can be expressed as:

# Post Adder Out = W + X + Y + Z.

where W, X, Y and Z are four built-in 48-bit wide bus multiplexers. Moreover, the post-adder can be statically configured into SIMD mode, supporting a single 48-bit adder, dual independent 24-bit adders, or quad independent 12-bit adders.

The outputs of the four multiplexers are always added together by the post-adder. There are dozens of combinations of inputs to these multiplexers: one of them can be: either C or all 0s on the X multiplexer; either A:B or all 0s on the X multiplexer; all 0s on the Y multiplexer; either P, PCIN, or all 0s on the Z multiplexer. The 30-bit A and 18-bit B data inputs can optionally be registered once or twice to construct a pipeline stage, while the 48-bit C data inputs can be optionally staged once. The post-adder's output can be staged into the P register, and the PCIN is the cascade input from a lower DSP slice. A nine-bit control input named OPMODE contains fields for W, X, Y, and Z multiplexer selects and can be dynamically changed.

TABLE I Resource Utilization Comparison.

|        | DSP48E2 | LUT | FF  | CARRY8 |
|--------|---------|-----|-----|--------|
| DSP    | 1       | 0   | 0   | 0      |
| Fabric | 0       | 86  | 114 | 8      |

Utilizing the wide bus multiplexer, the cascade datapath, and the SIMD mode of the post-adder in DSP48E2, we can pack up to 16 sets of synaptic operations into a single DSP slice.

In this work, the synaptic connection weights are quantized into INT8 by the well-established post-training quantization or quantization-aware training methods developed in traditional neural networks (NNs).

Four sets of INT8 weights are resized to INT12 and concatenated into 48-bit. The upper 30 bits are assigned to the input port A while the lower 18 bits are assigned to input port B. A and B get concatenated and multiplexed by the X multiplexer. In NNs, the input activations are shared by different sets of weights to generate different channels. In this case, one spike is fetched to dynamically switch the X multiplexer between the four sets of weights (A:B) and all 0s, performing four 2:1 multiplex operations simultaneously.

Similarly, another four sets of INT8 weights are resized, concatenated, and directly assigned to the C data input. another spike is fetched to dynamically switch the W multiplexer between C and all 0s, performing another four 2:1 multiplex operations.

The Z multiplexer selects the PCIN inputs and the partial sum from the lower DSP slice. The Y multiplexer outputs are set to all 0s. The post-adder is set to SIMD mode and acts as four independent 12-bit adders, summing the four multiplexers, and performing an equivalent number of eight addition operations. Therefore, as shown in Fig.2C, a single slice of DSP48E2 can contribute 16 synaptic operations in total without general fabric logic overhead.

Direct access to the specific features in DSP48 is achieved by directly instantiating the DSP48E2 primitive. The straightforward implementation of the synaptic operations described above will consume 86 Look-up-tables, 114 Flip-flops and 8 Carry chains. Though it might not seem expensive on a small scale, it is considerably less efficient than the proposed approach and will lead to a compromised clock frequency.

## C. Systolic Array for Synaptic Operations

The systolic array is a specialized mesh of homogeneous PEs designed to process massive parallel computations. It has the potential to run at a high frequency due to its regular and adjacent interconnections. However, designing systolic arrays is not trivial. Previous neuromorphic hardware adopting a systolic array architecture failed to achieve satisfactory performance, either in resource efficiency or clock frequency. Most systolic arrays targeting FPGA devices are implemented in low-speed general fabrics. In this paper, we design a high-performance systolic array featured by the DSP48E2 for SNNs.

A more straightforward representation of the aforementioned synaptic operations featured by a single DSP48E2 slice can be expressed as follow:

$$p_i = s_i \cdot W_i + p_{i-1}, p_{-1} = 0.$$

where  $s_i$  is the  $1 \times 2$  binary spike vector, and  $w_i$  is the  $2 \times 4$  INT8 synaptic weights matrix,  $p_i$  is the  $1 \times 4$  partial sum vector, and the  $p_{i-1}$  is the partial sum vector contributed by the lower DSP slice with the same shape as  $p_i$ . · represents the spikes-weights vector-matrix multiplication.

The 12-bit representation of each channel in  $p_i$  allows up to eight DSP48E2 slices to cascade in a row without possible numeric overflow. In this way, the extended synaptic operations featured by a cascaded DSP48E2 chain can be expressed as follows:

$$p = \sum_{i=0}^7 s_i \cdot W_i = s \cdot W$$

Where s is the  $1 \times 16$  binary spike vector, and W is the  $16 \times 4$  8-bit-integer (INT8) synaptic weights matrix, p is the  $1 \times 4$  partial sum vector.

The cascaded DSP48E2 chain is the basic processing element (PE) in our systolic array design. A PE consists of eight cascaded DSP48E2 slices. A  $M \times N$  systolic array consists of  $\frac{M}{4}$  columns of PE, with each column consisting of  $\frac{N}{16}$  PEs and an adder tree. Each column in the systolic array computes  $\frac{N}{16}$  1 × 16 binary spike vector and 16 × 4 weight matrix multiplication, while the adder tree sums up the results from  $\frac{N}{16}$  PEs, generating four output channels. With  $\frac{M}{4}$  columns, the systolic array generates M outputs channels in total.

Each PE in the systolic array contains different sets of synaptic weights. Adopting a weight-stationary scheme, synaptic weights remain cached in a PE until they are no longer needed. The same  $1 \times N$  binary spike vector is shared across columns horizontally, and M partial sums flow out of the systolic array vertically.

#### D. Spike Vector Generation for Convolution by Line Buffer

Similar to ANN, 2-D convolution is the basic operation in a digitalized SCNN. We incorporate the traditional line buffer design [38] to generate the spike window needed for the spikemap convolution. The line buffer is commonly seen in CNN accelerator design because it can efficiently achieve kernellevel parallelism and ensure good reuse of image data.

When FireFly is configured to SCNN mode,  $C_{in}$  channels of binary spike map are bundled together and stream into the line buffer. The  $K_h \times K_w$  spikes-bundle window is then flattened to a  $K_h \times K_w \times C_{in}$  vector and sent to the systolic array. In most of the established CNN architectures,  $3 \times 3$  convolution with stride 1 and the same padding is the most common configuration. The SCNN architecture follows this scheme. Ideally, general neuromorphic hardware for SNN should support all types of convolutional layers with different configurations. But the hardware would not work efficiently for all types of convolution configuration and such design would cause hardware overhead, thus might not be feasible. Therefore, we design specialized line buffer logic for  $3 \times 3$  convolution. Nevertheless, the methods discussed here are compatible with other kernel sizes. Using the Dynamic Function Exchange features in FPGA, hardware supporting different types of convolutional layers can be dynamically deployed in FPGA during runtime.

When FireFly is configured for multi-layer perception (MLP) topology mode, the line buffer datapath for SCNN is left idle and the shift register datapath for MLP is switched on. The shift register forms a serial-to-parallel stream width adapter by combining the  $C_{in}$  input spikes of  $K_h \times K_w$  input transactions into one. The length of the binary spike vector in SCNN and MLP datapaths is the same, compatible with the height of the systolic array.

#### E. Synaptic Weight Delivery in a Multi-level Hierachy

An  $M \times N$  systolic array configured in weight stationary mode needs  $M \times N$  sets of weights. Switching the current set of stationary synaptic weights with the next set of weights can be problematic. The instantaneous switching bandwidth is extremely high but switching occurs when weights expire.

The main idea of our solution is that the instantaneous bandwidth needed when switching to the next set of weights needs to be amortized over an idle period when the weights are kept stationary.

As shown in Fig.3D, we propose a 4-level synaptic weight memory hierarchy to enable on-the-fly delivery of weights with minimum resource consumption. First, the synaptic weight stream coming from the AXI DataMover is adapted by the Lv1 stream width adapter. The adapted weight stream flows into the Lv2 Partial Reuse FIFO and is reused T times. The weight stream from the Partial Reuse FIFO stage its way through the Lv3 width adapter and then gets cached in the Lv4 skid buffer. The systolic array holds the current set of weights stationary by applying back pressure to the skid buffer and releasing the pressure when the current set of weights is no longer needed.

A stream width adapter converts the N-bit input stream to a  $N \times M$ -bit output stream by allocating M elements of the input stream and firing them all at once. A skid buffer is the smallest Pipeline FIFO Buffer. It decouples two sides of a ready/valid handshake to allow back-to-back transfers without a combinational path between input and output, thus pipelining the path.

The Partial Reuse FIFO is the key component in this 4-level synaptic weight delivery hierarchy.

Most designs utilize the dual-port RAM to build a pingpong buffer (shown in Fig. 3A) or a FIFO, to hide the latency of the data transfer process. However, the traditional pingpong buffer mechanism can be problematic and the FIFO mechanism does not support data reuse.

The switching of the ping-pong buffer may complicate the controller design. Ping-pong buffers are costly and inefficient. The depth of the buffer must be large enough to support the most storage-expensive cases, not to mention the buffer size has to be doubled for ping-pong operation. However, the worst-case scenario will not occur in most cases. Only a small portion of the ping-pong buffer is occupied most of the time.



Fig. 3. Different Approaches for Hiding Data Transfer Latency to Improve Throughput. A) Ping-pong buffer. B) Synchronous FIFO. C) The Proposed Patrial Reuse FIFO. D) A four-level synaptic weights delivery hierarchy to enable synaptic weights reuse, reduce off-chip memory bandwidth and hide the weight loading latency to the systolic array.

While the aforementioned problems are negligible in ANN accelerator design, we cannot afford to "double the size" in SNN neuromorphic hardware design because the memory bandwidth needed has already increased multiple times.

Ideally, the on-chip buffer that stores the synaptic weights in SNN should have the following properties:

- We do not need to double the buffer size and split the buffer into two regions for ping-pong operation just to guarantee no read-write collision will happen. No manual switching of the split buffers is needed.
- In SNN, the same synaptic weights need to be accessed at every timestep. We expect the data in the buffer can be read several times before they expire and are replaced by new data.
- 3) The depth of the buffer is set to support the most storageexpensive cases, but multiple batches of data can be preloaded into the available large RAM spaces when the storage requirements are less expensive.

We propose Partial Reuse FIFO, to address the above requirements and enable data reuse and space exploration without complex control logic.

As shown in Fig. 3B, traditional synchronous FIFO can be described using a ring. The circumference of the ring represents the depth of the FIFO. The width of the ring represents the data width of the FIFO. A push pointer is used to mark the write address of the incoming data. A pop pointer is used to mark the read address of the output data. When the push pointer and the pop pointer point to the same address, the FIFO is either full or empty, depending on whether the occupancy of the FIFO is rising or falling. When the FIFO is full, the ready signal to the inputs AXI-Stream is clear. When the FIFO is empty, the valid signal to the outputs AXI-Stream is clear.

As shown in Fig.3C, the mechanism of the Partial Reuse FIFO is the same as the traditional synchronous FIFO, except that a partial region in the FIFO ring cannot be flushed by incoming data until it is reused T times, where T is a control register of the Partial Reuse FIFO.

The reuse region of the FIFO is labeled by Start and End. The pop pointer jumps back to the Start position whenever it reaches the end. The reuse counter increases whenever the pop pointer jumps back to Start. The Start label stays the same when the region is still being reused. When the counter reaches T, the counter is reset, label End becomes the next label Start and the next label End is set by Start+L-1, where L is another control register of the partial reuse FIFO. Unlike the traditional synchronous FIFO, when the push pointer meets the label Start, the Partial Reuse FIFO is full and the ready signal to the inputs AXI-Stream is clear. When label End is ahead of the push pointer, the Partial Reuse FIFO is considered empty until the reuse sector of the FIFO is filled by the input stream.

The partial reuse FIFO satisfies the aforementioned properties. Using the valid-ready handshake protocol of the AXI-Stream, the function of the partial reuse FIFO is self-contained, with only two control registers exposed. The partial reuse FIFO contains only a monolithic RAM and does not need to be split. The push-pop pointer in the FIFO control logic ensures no read-write collision. The reuse sector protected by the Start-End label enables data reuse. New data from multiple batches can be pushed to the partial reuse FIFO sequentially as long as the FIFO is not full.

#### F. Psum-Vmem Unified Buffer and Spike Generation Logic

A classic systolic array consumes data from the inputs and weights data domain and feeds data to the outputs data domain. If one data domain stays stationary, the other two must flow through the computing logic. This metric holds for the three classic input, weight and output stationary dataflows.

Our architecture adopts the weight stationary dataflow. In this case, synaptic weights remain stationary in the systolic array, and the input binary spikes and the output flow in and out of the systolic array. The flowing spike vector is generated by the line buffer mechanism, and the outputs are stored in the proposed Psum-Vmem Unified Buffer.

In our architecture, the synaptic operations in SNN are spatially parallelized. However, it is unlikely to flatten a whole layer spatially onto the area-power-restricted hardware substrates. Therefore, certain tiling strategies need to be implemented. We adopt the channel tiling strategy to accommodate layers with a large number of channels to the same systolic array. Input spike map channels are split into multiple tiles to fit into the height of the systolic array. Output spike map channels are calculated N at a time according to the width of the systolic array.



Fig. 4. Psum-Vmem Update Mechanism. A) The finite-state-machine performing the Psum-Vmem update. B) The proposed Psum-Vmem unified buffer and Psum-Vmem update engine. C) The hardware implementation details of the Psum-Vmem update engine.

In each single timestep, the partial sums of the N output spike map channels are stored on-chip and are not fully accumulated until all tiles of the input spike map channels are calculated. In each layer, the membrane voltage of the Noutput spike map channels are also needed to be stored onchip until all timesteps are iterated. Instead of instantiating a separate buffer for partial sum and membrane voltage, we propose the Psum-Vmem Unified Buffer to reduce RAM consumption.

Since tiles of input spike map channels in a single timestep are sent to the computing array one by one and the temporal dimension of SNN is kept in its natural way of executing in a sequential manner, the partial sum accumulating process and the membrane voltage update process can be scheduled using a finite state machine. There are three states specified in the FSM: accumulating phase, thresholding Phase, and clearing phase.

During the accumulating phase, Psum extracted from the Psum-Vmem unified buffer is accumulated by the computing results from the systolic array. When the last tile of the input spike map channel in the current timestep arrives and the current timestep is not the last, the FSM switches to the thresholding phase. The extracted Psum is first accumulated, then processed by the optional leaky unit and the thresholding unit, and eventually written back to the unified buffer. The accumulated Vmem will be subtracted from a fixed portion of its value by the optional leaky unit to support the LIF neuron dynamics. The thresholding unit will compare the Vmem with the threshold, generate a spike, and reset the Vmem if it exceeds the threshold. All of the computations are pipelined to improve timing. The FSM switches back to the accumulating phase when this phase finishes. When the last tile of the input spike map channel in the last timestep arrives, the FSM switches to the Clearing Phase. The computation process is the same as the thresholding phase, except that the Vmem value will be cleared to reset the unified buffer for the next SNN layer.

## V. IMPLEMENTATION AND EXPERIMENTS

#### A. Experiments Setup

Most neuromorphic hardware uses expensive large FPGA devices, ignoring the feasibility of deploying such hardware in the real world. FireFly is mapped onto several off-the-shelf commercially available Xilinx Zynq Ultrascale FPGAs, including the Ultra96v2, KV260 and ZCU104 FPGA evaluation boards, bringing hope of SNN real-world applications in an edge scenario. The FPGA chips of the three evaluation boards are xczu3eg, xczu5ev, and xczu7ev, respectively.

Our proposed FireFly is designed using SpinalHDL, a hardware description language equipped with object-oriented programming and functional programming. Compared with an HLS-based code template, parameterized Verilog, or SystemVerilog, SpinalHDL can offer a higher level of abstraction and reconfigurability. The Verilog codes generated by the SpinalHDL compiler are synthesized and implemented in the Xilinx Vivado 2021.1 with ML-Based design optimization to achieve a higher clock rate and faster timing closure. Power consumption estimates and timing results are obtained after place-and-route using the power analysis and timing summary tools in the Vivado Design Suite, which provides detailed analysis and accurate estimation. Throughput performance is obtained by recording the timer value on the PS side of Zynq while the PL runs the benchmark tasks.

FireFly is based on the Brain-inspired Cognitive Engine (BrainCog) and is a first step towards the software-hardware co-design for the BrainCog project [41] [42].

#### B. Bridging the Gap between Peak and Avg. GSOP/s

The theoretical peak GSOP/s of an SNN accelerator is given as:

Peak GSOP/s = 
$$2 \times f \times M \times N$$
. (5)

where f is the system clock frequency, and  $M \times N$  denotes the size of the systolic array. The peak GSOP/s calculation is the same as [20] and [24]. In FireFly, M denotes the number of columns in the systolic array, while N denotes the rows. The peak performance should be proportional to the systolic array size. However, the actual throughput, or average GSOP/s, can be degraded due to insufficient bandwidth and inefficient controller design.

In our design, the line buffer mechanism enables binary spike map reuse, the partial reuse FIFO enables synaptic weight reuse, and the Psum-Vmem buffer is used to avoid back-and-forth fetch and store. The memory bandwidth needed for off-chip data transfer is minimized, and thus not a bottleneck of the system's average performance.

TABLE II COMPARISON WITH OTHER WORKS IN RESOURCE UTILIZATION.

| Work Device       | Slice LUTs |      | Slice Registers |      | BRAM/URAM   |       | DSP48       |      | Frequency   | Peak GSOP/s |          |
|-------------------|------------|------|-----------------|------|-------------|-------|-------------|------|-------------|-------------|----------|
|                   | Device     | Used | Utilization     | Used | Utilization | Used  | Utilization | Used | Utilization | riequency   |          |
| [39]              | xc7vx690t  | 53k  | 12.20%          | 100k | 11.50%      | 65    | 4.40%       | 0    | 0%          | 100         | /        |
| [22]              | xc7k325t   | 170k | 83.70%          | 113k | 27.70%      | 254   | 57.10%      | 0    | 0%          | 135         | 3.2      |
| [24]              | xcvu440    | 302k | 11.90%          | 421k | 8.30%       | 192   | 7.60%       | 0    | 0%          | 200         | 1562.5   |
| [40]              | xcku115    | 585k | 88.20%          | 232k | 17.40%      | 432   | 20%         | 0    | 0%          | 140         | 253      |
| [25]              | 28nm ASIC  | /    | /               | /    | /           | /     | /           | /    | /           | 200         | 684.5    |
| [31]              | 28nm ASIC  | /    | /               | /    | /           | /     | /           | /    | /           | 200         | 3970.1   |
| ours <sup>1</sup> | xczu3eg    | 15k  | 21.40%          | 53k  | 37.50%      | 162   | 75%         | 288  | 80%         | 300         | 1382.4   |
| ours <sup>2</sup> | xczu7ev    | 42k  | 18.20%          | 196k | 42.60%      | 25/40 | 11.5/41.6%  | 1152 | 66.60%      | 300         | 5529.6   |
| ours <sup>3</sup> | xczu5ev    | 32k  | 27.35%          | 112k | 47.86%      | 16/24 | 11.1%/37.5% | 576  | 46.20%      | 300         | 1382.4×2 |

<sup>1</sup> FireFly with a  $16 \times 144$  systolic array implemented on Ultra96v2.

<sup>2</sup> FireFly with a  $32 \times 228$  systolic array implemented on ZCU104.

<sup>3</sup> FireFly with two  $16 \times 144$  systolic arrays implemented on KV260.

We argue that the communication between the controller and the accelerator significantly impact the system's actual throughput. Note that we choose the Zynq devices as the system platforms. The built-in host CPU controller enables fast deployment of different SNN networks without the need to change the PL logic. In most Zynq-based SNN accelerators such as Cerebron [20], the host program in the Zynq processing system sends synaptic weights and binary input spike maps into the Zynq programmable logic and collects the output spike maps in different SNN layers. However, the control command sequence traveling between PS and PL through the low-performance AXI-Lite protocol induces nonnegligible latency, leaving the systolic array idle and reducing the average throughput. In FireFly, the host program generates a command sequence in advance and sends the commands to PL through a high-performance AXI-Stream to the internal command queue of the AXI DataMover. In this way, the reqack waiting clock cycles between commands are eliminated. The average throughput can go a step further.

#### C. Performance Analysis

The size of the systolic can be statically reconfigured in FireFly according to the on-chip resources on different evaluation boards. A  $M \times N$  systolic array in FireFly receives N presynaptic inputs and produces partial sum for M neurons, where M = P and  $N = K_h \times K_w \times P$ . The resource consumption, memory bandwidth and acceleration performance is linearly proportional to the parallelism factor P. P can be any value as long as the systolic array can fit in the target device. As P is also the tiling factor of the input and output channels in a convolutional layer, it is preferable to set P to a power of 2 because the number of channels in most convolutional layers is a power of two. Therefore, we evaluate two representative configurations,  $16 \times 144$  and  $32 \times 288$  to demonstrate the reconfigurability of FireFly.

The usage of DSP48 to implement synaptic operations significantly reduces the fabric overhead and achieves significant GSOP/s improvements compared with most existing hardware. The performance of FireFly is still impressive. FireFly with a  $16 \times 144$  systolic array can achieve a peak performance of 1382.4GSOP/s, and FireFly with a  $32 \times 288$  systolic array can achieve a peak performance of 5529.6GSOP/s, as shown in Table II.

To the best of our knowledge, SIES [24] achieves the highest GSOP/s among all the existing FPGA-based accelerators. Compared with SIES [24], FireFly mapped on xczu3eg consumes only  $\frac{1}{20}$  LUTs and  $\frac{1}{8}$  FFs but still achieve similar GSOP/s, whereas FireFly mapped on xczu7ev consumes only  $\frac{1}{7}$  LUTs and  $\frac{1}{2}$  LUTs FFs and achieves a ×3.5 speed up. Additionally, we map two heterogeneous FireFly cores onto xczu5ev to support the concurrent inference of two independent SNNs.

We can still achieve higher throughput when compared with SpinalFlow and SATO, which are state-of-the-art SNN hardware accelerators built in 28nm ASIC. We are well aware that it is difficult to make an apples-to-apples comparison with the hardware adopting different design methodologies, supporting different types of neurons, using different synaptic weight precisions or implementing on different platforms, FireFly can still be called a high-performance SNN accelerator due to its excellent GSOP/s performance.

## D. Benchmark Evaluations

We deploy several state-of-the-art SNN networks trained by backpropagation algorithms [4] on FireFly to test the inference performance. We evaluate not only the static datasets such as MNIST, CIFAR10 and CIFAR100 but also the neuromorphic datasets such as DVS-CIFAR10 and DVS-Gesture.

The models are trained using surrogate functions like quadratic gate and arctangent gradient. Direct coding and backpropagation through time algorithm significantly reduce the total timesteps of the SNNs. In our experiment, the timesteps are scaled down to four without a significant accuracy drop. These training algorithms are provided in BrainCog's infrastructures [41] [42].

We first apply batchnorm fusion to merge the batch normalization layer with the preceding convolutional layer to deploy the Pytorch-Trained SNN model to FireFly. Then we adopt post-training quantization techniques to convert the Float32 synaptic weights to INT8 and the Float32 threshold to INT18. Note that the performance drop of post-training quantization without further retraining or fine-tuning is negligible in SNN because no scaling errors of multiplications are introduced.

|           |      |                                 |             |         | 1        |                          |              | _         |        |
|-----------|------|---------------------------------|-------------|---------|----------|--------------------------|--------------|-----------|--------|
|           | Work | Network                         | Dataset     | Latency | Accuracy | GSOP/s                   | Device       | Frequency | power  |
| TVLSI'14  | [43] | 784-500-500-10                  | MNIST       | 9.25ms  | 94.2     | /                        | xc6slx150t   | 75MHz     | 1.5W   |
| ICCAD'20  | [27] | 28x28-32c3-p2-32c3-p2-256-10    | MNIST       | 7.53ms  | 99.42    | /                        | xczu9eg      | 125MHz    | 4.5W   |
| TCAD'22   | [30] | 28x28-16c-32c-8c-10             | MNIST       | 45us    | 98.5     | 22.6                     | xc7z045      | 200MHz    | 0.96W  |
| TCAS-I'21 | [44] | 784-200-100-10                  | MNIST       | 3.15ms  | 92.93    |                          | xc7vx485t    | 100MHz    | /      |
| JCST'20   | [24] | 28x28-12c5-p2-64c5-p2-10        | MNIST       | /       | 99.16    | 1562.5                   | xcvu440      | 200MHz    | /      |
| TCAD'21   | [22] | 32x32-32c3-p2-32c3-p2-256-10    | SVHN        | 1.21 ms | 82.15    | 3.2                      | xc7k325t     | 100MHz    | 0.699W |
|           |      | 784-512-256-128-64-10           | FMNIST      | 0.14 ms | 89.01    | 3.2                      |              | 200MHz    | 0.982W |
| TRETS'22  | [45] | 28x28-32c3-p2-32c3-p2-256-10    | MNIST       | 77us    | 99.17    | 1                        | xczu9eg      | 200MHz    | 24.5W  |
|           |      | 32x32-(192c5-192c1-192c1-p3)*2- | CIEAD 10    | 6.9.000 | 99.10    | '                        |              |           |        |
|           |      | 192c5-192c1-10c1-AP-10          | CIFAKIU     | 0.81115 | 00.19    |                          |              |           |        |
| DATE'22   | [26] | 144x144-p4-32c-p2-              | NMNIST      | 3.83ms  | 97.81    | 51.2                     | 22nm<br>ASIC | 400MHz    | 0.11W  |
|           |      | 32c-p2-512-512-11               | DVS-Gesture | 7.1ms   | 92.4     | 51.2                     |              |           |        |
|           | ours | SCNN-5 <sup>1</sup>             | MNIST       | 0.491ms | 98.12%   | <b>91%</b> <sup>5</sup>  |              |           |        |
|           |      | SCNN-7 <sup>2</sup>             | CIFAR10     | 1.035ms | 91.36%   | <b>89%</b> <sup>5</sup>  | 1            |           |        |
|           |      | SCNN-11 <sup>3</sup>            | CIFAR100    | 2.125ms | 64.28%   | <b>86</b> % <sup>5</sup> | xczu3eg      | 300MHz    | 2.55W  |
|           |      | SCNN-9 <sup>4</sup>             | DVS-CIFAR10 | 3.541ms | 72.40%   | <b>87</b> % <sup>5</sup> | 1            |           |        |
|           |      | SCNN-9 <sup>4</sup>             | DVS-Gesture | 3.541ms | 89.29%   | <b>87</b> % <sup>5</sup> | ]            |           |        |

<sup>1</sup> SCNN-5: 28x28-16c3-64c3-p2-128c3-p2-256c3-256c3-10

<sup>2</sup> SCNN-7: 32x32-16c3-64c3-p2-128c3-128c3-p2-256c3-256c3-p2-512c3-10

<sup>3</sup> SCNN-9: 48x48-16c3-64c3-64c3-p2-128c3-128c3-p2-256c3-256c3-p2-512c3-512c3-10

<sup>4</sup> SCNN-11: 32x32-16c3-64c3-64c3-p2-128c3-128c3-128c3-p2-256c3-256c3-256c3-p2-512c3-512c3-100

<sup>5</sup> The GSOP/s utilization ratio: Actual measured GSOP/s divided by the peak GSOP/s. The peak GSOP/s is 1382.4 on xczu3eg.

FireFly shows reconfigurability on different SNN models for different image classification tasks. We evaluate four different SNN model structures with 5, 7, 9, and 11 convolutional layers on five different datasets, shown in Table III. Note that our chosen device, xczu3eg, is an edge device having the fewest resources among all the listed hardware, but still, Fire-Fly shows significant improvement in all these benchmarks. Compared with [27], FireFly achieves a  $\times 15$  speed up and similar accuracy on the MNIST dataset. Compared with [21], FireFly achieves higher accuracy and a  $\times 6$  inference speed up on CIFAR10 dataset. Compared with ASIC design [26], FireFly achieves a  $\times 2$  speed up and similar accuracy on DVS-Gesture dataset. Note that our SNN models are considerably bigger and deeper than the listed benchmarks.

When using a larger xczu7ev device, all the inference performances listed above are improved by  $\times 4$  because xczu7ev supports higher parallelism and has a peak performance of 5.523TSOP/s. Our system also supports multiple heterogeneous cores running different SNN models concurrently. When targeting xczu5ev, two FireFly cores can be deployed independently to support multiple real-world tasks.

# E. Discussion

We argue that for FPGA-based SNN accelerator design, the benefits of designing complicated hardware supporting spike sparsity may not make up for the losses of irregular interconnect and underutilization of the dedicated hard block.

The system clock frequency can have a significant impact on inference performance. Compared with ASICs, routing in FPGAs contributes more delay time since logic elements are connected through a series of switching matrices instead of direct physical wires. A complex digital design with irregular interconnect can easily violate the timing requirements even in the most state-of-the-art FPGA devices. Most existing FPGA-based SNN accelerators can only satisfy the timing requirement of at most 200MHz even on the expensive Virtex Ultrascale+ device.

An important aspect of FPGA low-power system design is to utilize the existing dedicated hard block rather than build one from scratch. Implementing the same function using the dedicated hard block in FPGAs usually consumes less energy than using the general fabric counterparts. However, most existing FPGA-based SNN accelerators fail to delve into the features provided by the existing dedicated hard block and adopt a no-brainer implementation of spike computation using low-speed fabric.

In this paper, FireFly provides a different perspective on designing dedicated neuromorphic hardware for spiking neural networks targeting FPGA devices. We are well aware that it is important to design hardware that supports sparsity acceleration. However, to our best knowledge, only few studies [25] [31] targeting ASICs can show significant speed-ups considering this inherent nature of SNNs, not to mention the large majority of FPGA-based designs. Instead of designing complicated circuits to support the sparsity acceleration, FireFly consists of a monolithic systolic array and adopts a straightforward weight stationary dataflow. The acceleration comes from the clock frequency improvement brought by the regular and simple interconnect of the systolic array, the pipelined arithmetic computations, and, most importantly, the flexible use of the multi-function DSP48E2s.

In fact, the potential of the DSP48E2 is still far from being fully realized. Wu et al. [11] proposed a high-throughput processing array for matrix multiplication based on DSP supertile and achieved peak DSP clock rates on Xilinx UltraScale (741 MHz) and UltraScale+ (891 MHz) devices. SNN accelerators can incorporate the DSP supertile design and achieve even higher performance.

The potential of other dedicated hard blocks on FPGA is also yet to be exploited. Scaling the Cascades [10] fully utilized the dedicated cascade interconnect of the DSP48E2,

BRAM36K, and URAM288K and achieved nearly 100 % usage of these hard blocks, delivering incredible inference speed on MLPerf benchmarks. It is necessary to migrate the existing hardware optimization techniques of ANN accelerator design to SNN neuromorphic hardware research.

Nevertheless, we agree that ideally, the main advantage of new SNN accelerators compared to ANNs on digital hardware comes primarily from exploiting the sparsity of spikes and not from the replacement of MAC operations with AC operations [46]. Future neuromorphic hardware design should exploit spike sparsity and migrate existing FPGA optimization techniques simultaneously.

#### VI. CONCLUSIONS

In this work, we introduced a high-throughput and reconfigurable hardware accelerator for spiking neural networks. To achieve high-performance inference of SNN, we fully exploited the features of the dedicated DSP48E2 embedded in the FPGA and achieved the highest GSOP/s compared with the existing accelerator designs. To improve memory efficiency, we designed a synaptic weight delivery hierarchy and a Psum-Vmem unified buffer to support the high parallelism. To demonstrate FireFly's reconfigurability, we evaluated multiple deep SNN models on various datasets. To make SNN applications more convenient, we used off-the-shelf commercially available FPGA edge devices, offering a more feasible solution than any other existing hardware. In the future, we will try to migrate more optimization techniques targeting FPGAs while exploring sparsity acceleration to enable more energy-efficient SNN software and hardware co-design.

#### REFERENCES

- Wolfgang Maass, "Networks of spiking neurons: the third generation of neural network models," *Neural networks*, vol. 10, no. 9, pp. 1659–1671, 1997.
- [2] Yujie Wu, Lei Deng, Guoqi Li, Jun Zhu, and Luping Shi, "Spatiotemporal backpropagation for training high-performance spiking neural networks," *Frontiers in neuroscience*, vol. 12, p. 331, 2018.
- [3] Wenrui Zhang and Peng Li, "Temporal spike sequence learning via backpropagation for deep spiking neural networks," Advances in Neural Information Processing Systems, vol. 33, pp. 12022–12033, 2020.
- [4] Guobin Shen, Dongcheng Zhao, and Yi Zeng, "Backpropagation with biologically plausible spatiotemporal adjustment for training deep spiking neural networks," vol. 3, no. 6, p. 100522. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S2666389922001192
- [5] Youngeun Kim and Priyadarshini Panda, "Revisiting batch normalization for training low-latency deep spiking neural networks from scratch," *Frontiers in neuroscience*, p. 1638, 2020.
- [6] Hanle Zheng, Yujie Wu, Lei Deng, Yifan Hu, and Guoqi Li, "Going deeper with directly-trained larger spiking neural networks," in *Proceedings of the AAAI Conference on Artificial Intelligence*, vol. 35, no. 12, 2021, pp. 11062–11070.
- [7] Yujie Wu, Lei Deng, Guoqi Li, Jun Zhu, Yuan Xie, and Luping Shi, "Direct training for spiking neural networks: Faster, larger, better," in *Proceedings of the AAAI Conference on Artificial Intelligence*, vol. 33, no. 01, 2019, pp. 1311–1318.
- [8] Mingkun Xu, Yujie Wu, Lei Deng, Faqiang Liu, Guoqi Li, and Jing Pei, "Exploiting spiking dynamics with spatial-temporal feature normalization in graph learning," arXiv preprint arXiv:2107.06865, 2021.
- [9] Yu-Hsin Chen, Tushar Krishna, Joel S. Emer, and Vivienne Sze, "Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks," vol. 52, no. 1, pp. 127–138, conference Name: IEEE Journal of Solid-State Circuits.

- [10] Ananda Samajdar, Tushar Garg, Tushar Krishna, and Nachiket Kapre, "Scaling the cascades: Interconnect-aware FPGA implementation of machine learning problems," in 2019 29th International Conference on Field Programmable Logic and Applications (FPL), pp. 342–349, ISSN: 1946-1488.
- [11] Ephrem Wu, Xiaoqian Zhang, David Berman, and Inkeun Cho, "A high-throughput reconfigurable processing array for neural networks," in 2017 27th International Conference on Field Programmable Logic and Applications (FPL), pp. 1–4, ISSN: 1946-1488.
- [12] Mike Davies, Narayan Srinivasa, Tsung-Han Lin, Gautham Chinya, Yongqiang Cao, Sri Harsha Choday, Georgios Dimou, Prasad Joshi, Nabil Imam, Shweta Jain, Yuyun Liao, Chit-Kwan Lin, Andrew Lines, Ruokun Liu, Deepak Mathaikutty, Steven McCoy, Arnab Paul, Jonathan Tse, Guruguhanathan Venkataramanan, Yi-Hsin Weng, Andreas Wild, Yoonseok Yang, and Hong Wang, "Loihi: A neuromorphic manycore processor with on-chip learning," vol. 38, no. 1, pp. 82–99, conference Name: IEEE Micro.
- [13] Jing Pei, Lei Deng, Sen Song, Mingguo Zhao, Youhui Zhang, Shuang Wu, Guanrui Wang, Zhe Zou, Zhenzhi Wu, Wei He, Feng Chen, Ning Deng, Si Wu, Yu Wang, Yujie Wu, Zheyu Yang, Cheng Ma, Guoqi Li, Wentao Han, Huanglong Li, Huaqiang Wu, Rong Zhao, Yuan Xie, and Luping Shi, "Towards artificial general intelligence with hybrid tianjic chip architecture," vol. 572, no. 7767, pp. 106–111, number: 7767 Publisher: Nature Publishing Group. [Online]. Available: https://www.nature.com/articles/s41586-019-1424-8
- [14] Eustace Painkras, Luis A. Plana, Jim Garside, Steve Temple, Francesco Galluppi, Cameron Patterson, David R. Lester, Andrew D. Brown, and Steve B. Furber, "SpiNNaker: A 1-w 18-core system-on-chip for massively-parallel neural network simulation," vol. 48, no. 8, pp. 1943– 1953, conference Name: IEEE Journal of Solid-State Circuits.
- [15] Filipp Akopyan, Jun Sawada, Andrew Cassidy, Rodrigo Alvarez-Icaza, John Arthur, Paul Merolla, Nabil Imam, Yutaka Nakamura, Pallab Datta, Gi-Joon Nam, Brian Taba, Michael Beakes, Bernard Brezzo, Jente B. Kuang, Rajit Manohar, William P. Risk, Bryan Jackson, and Dharmendra S. Modha, "TrueNorth: Design and tool flow of a 65 mW 1 million neuron programmable neurosynaptic chip," vol. 34, no. 10, pp. 1537–1557, conference Name: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.
- [16] Johannes Schemmel, Daniel Brüderle, Andreas Grübl, Matthias Hock, Karlheinz Meier, and Sebastian Millner, "A wafer-scale neuromorphic hardware system for large-scale neural modeling," in 2010 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1947–1950, ISSN: 2158-1525.
- [17] J. Feldmann, N. Youngblood, C. D. Wright, H. Bhaskaran, and W. H. P. Pernice, "All-optical spiking neurosynaptic networks with self-learning capabilities," vol. 569, no. 7755, pp. 208–214, number: 7755 Publisher: Nature Publishing Group. [Online]. Available: https://www.nature.com/articles/s41586-019-1157-8
- [18] Tiankuang Zhou, Xing Lin, Jiamin Wu, Yitong Chen, Hao Xie, Yipeng Li, Jingtao Fan, Huaqiang Wu, Lu Fang, and Qionghai Dai, "Large-scale neuromorphic optoelectronic computing with a reconfigurable diffractive processing unit," vol. 15, no. 5, pp. 367–373, number: 5 Publisher: Nature Publishing Group. [Online]. Available: https://www.nature.com/articles/s41566-021-00796-w
- [19] Jia-Qin Yang, Ruopeng Wang, Zhan-Peng Wang, Qin-Yuan Ma, Jing-Yu Mao, Yi Ren, Xiaoyang Yang, Ye Zhou, and Su-Ting Han, "Leaky integrate-and-fire neurons based on perovskite memristor for spiking neural networks," vol. 74, p. 104828. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S2211285520303852
- [20] Qinyu Chen, Chang Gao, and Yuxiang Fu, "Cerebron: A reconfigurable architecture for spatiotemporal sparse spiking neural networks," vol. 30, no. 10, pp. 1425–1437, conference Name: IEEE Transactions on Very Large Scale Integration (VLSI) Systems.
- [21] Sathish Panchapakesan, Zhenman Fang, and Jian Li, "SyncNN: Evaluating and accelerating spiking neural networks on FPGAs," in 2021 31st International Conference on Field-Programmable Logic and Applications (FPL), pp. 286–293, ISSN: 1946-1488.
- [22] Wujian Ye, Yuehai Chen, and Yijun Liu, "The implementation and optimization of neuromorphic hardware for supporting spiking neural networks with MLP and CNN topologies," pp. 1–1, conference Name: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.
- [23] Myat Thu Linn Aung, Chuping Qu, Liwei Yang, Tao Luo, Rick Siow Mong Goh, and Weng-Fai Wong, "DeepFire: Acceleration of convolutional spiking neural network on modern field programmable gate arrays," in 2021 31st International Conference on Field-Programmable Logic and Applications (FPL), pp. 28–32, ISSN: 1946-1488.

- [24] Shu-Quan Wang, Lei Wang, Yu Deng, Zhi-Jie Yang, Sha-Sha Guo, Zi-Yang Kang, Yu-Feng Guo, and Wei-Xia Xu, "SIES: A novel implementation of spiking convolutional neural network inference engine on field-programmable gate array," vol. 35, no. 2, pp. 475–489. [Online]. Available: https://doi.org/10.1007/s11390-020-9686-z
- [25] Surya Narayanan, Karl Taht, Rajeev Balasubramonian, Edouard Giacomin, and Pierre-Emmanuel Gaillardon, "SpinalFlow: An architecture and dataflow tailored for spiking neural networks," in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pp. 349–362.
- [26] Alfio Di Mauro, Arpan Suravi Prasad, Zhikai Huang, Matteo Spallanzani, Francesco Conti, and Luca Benini, "SNE: an energy-proportional digital accelerator for sparse event-based convolutions," in 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 825– 830, ISSN: 1558-1101.
- [27] Haowen Fang, Zaidao Mei, Amar Shrestha, Ziyi Zhao, Yilan Li, and Qinru Qiu, "Encoding, model, and architecture: Systematic optimization for spiking neural network in FPGAs," in 2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD), pp. 1–9, ISSN: 1558-2434.
- [28] Jeong-Jun Lee and Peng Li, "Reconfigurable dataflow optimization for spatiotemporal spiking neural computation on systolic array accelerators," in 2020 IEEE 38th International Conference on Computer Design (ICCD), pp. 57–64, ISSN: 2576-6996.
- [29] Jeong-Jun Lee, Wenrui Zhang, and Peng Li, "Parallel time batching: Systolic-array acceleration of sparse spiking neural computation," in 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp. 317–330, ISSN: 2378-203X.
- [30] Qinyu Chen, Chang Gao, Xinyuan Fang, and Haitao Luan, "Skydiver: A spiking neural network accelerator exploiting spatio-temporal workload balance," pp. 1–1, conference Name: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.
- [31] Fangxin Liu, Wenbo Zhao, Zongwu Wang, Yongbiao Chen, Tao Yang, Zhezhi He, Xiaokang Yang, and Li Jiang, "SATO: spiking neural network acceleration via temporal-oriented dataflow and architecture," in *Proceedings of the 59th ACM/IEEE Design Automation Conference*, ser. DAC '22. Association for Computing Machinery, pp. 1105–1110. [Online]. Available: https://doi.org/10.1145/3489517.3530592
- [32] Eugene M Izhikevich, "Which model to use for cortical spiking neurons?" *IEEE transactions on neural networks*, vol. 15, no. 5, pp. 1063– 1070, 2004.
- [33] Alan L Hodgkin and Andrew F Huxley, "A quantitative description of membrane current and its application to conduction and excitation in nerve," *The Journal of physiology*, vol. 117, no. 4, p. 500, 1952.
- [34] Larry F Abbott, "Lapicque's introduction of the integrate-and-fire model neuron (1907)," *Brain research bulletin*, vol. 50, no. 5-6, pp. 303–304, 1999.
- [35] Peter Dayan, Laurence F Abbott *et al.*, "Theoretical neuroscience: computational and mathematical modeling of neural systems," *Journal* of Cognitive Neuroscience, vol. 15, no. 1, pp. 154–155, 2003.
- [36] Ling Zhang, Jing Yang, Cong Shi, Yingcheng Lin, Wei He, Xichuan Zhou, Xu Yang, Liyuan Liu, and Nanjian Wu, "A cost-efficient high-speed VLSI architecture for spiking convolutional neural network inference using time-step binary spike maps," vol. 21, no. 18, p. 6006, number: 18 Publisher: Multidisciplinary Digital Publishing Institute. [Online]. Available: https://www.mdpi.com/1424-8220/21/18/6006
- [37] Xilinx Inc., "Ultrascale architecture dsp slice user guide," 2021.
- [38] B. Bosi, G. Bois, and Y. Savaria, "Reconfigurable pipelined 2-d convolvers for fast digital signal processing," vol. 7, no. 3, pp. 299–308, conference Name: IEEE Transactions on Very Large Scale Integration (VLSI) Systems.
- [39] Shasha Guo, Lei Wang, Shuquan Wang, Yu Deng, Zhijie Yang, Shiming Li, Zhige Xie, and Qiang Dou, "A systolic SNN inference accelerator and its co-optimized software framework," in *Proceedings* of the 2019 on Great Lakes Symposium on VLSI, ser. GLSVLSI '19. Association for Computing Machinery, pp. 63–68. [Online]. Available: https://doi.org/10.1145/3299874.3317966
- [40] Yisong Kuang, Xiaoxin Cui, Zilin Wang, Chenglong Zou, Yi Zhong, Kefei Liu, Zhenhui Dai, Dunshan Yu, Yuan Wang, and Ru Huang, "ESSA: Design of a programmable efficient sparse spiking neural network accelerator," pp. 1–11, conference Name: IEEE Transactions on Very Large Scale Integration (VLSI) Systems.
- [41] Yi Zeng, Dongcheng Zhao, Feifei Zhao, Guobin Shen, Yiting Dong, Enmeng Lu, Qian Zhang, Yinqian Sun, Qian Liang, Yuxuan Zhao, Zhuoya Zhao, Hongjian Fang, Yuwei Wang, Yang Li, Xin Liu, Chengcheng Du, Qingqun Kong, Zizhe Ruan, and Weida Bi, "Braincog: A spiking neural network based brain-inspired cognitive"

intelligence engine for brain-inspired ai and brain simulation," 2022. [Online]. Available: https://arxiv.org/abs/2207.08533

- [42] "Braincog: Brain-inspired cognitive intelligence engine." [Online]. Available: http://www.brain-cog.network
- [43] Daniel Neil and Shih-Chii Liu, "Minitaur, an event-driven FPGA-based spiking network accelerator," vol. 22, no. 12, pp. 2621–2628, conference Name: IEEE Transactions on Very Large Scale Integration (VLSI) Systems.
- [44] Sixu Li, Zhaomin Zhang, Ruixin Mao, Jianbiao Xiao, Liang Chang, and Jun Zhou, "A fast and energy-efficient SNN processor with adaptive clock/event-driven computation scheme and online learning," vol. 68, no. 4, pp. 1543–1552, conference Name: IEEE Transactions on Circuits and Systems I: Regular Papers.
- [45] Sathish Panchapakesan, Zhenman Fang, and Jian Li, "SyncNN: Evaluating and Accelerating Spiking Neural Networks on FPGAs," *ACM Transactions on Reconfigurable Technology and Systems*, vol. 15, no. 4, pp. 48:1–48:27, Dec. 2022. [Online]. Available: https://doi.org/10.1145/3514253
- [46] Manon Dampfhoffer, Thomas Mesquida, Alexandre Valentian, and Lorena Anghel, "Are SNNs really more energy-efficient than ANNs? an in-depth hardware-aware study," pp. 1–11, conference Name: IEEE Transactions on Emerging Topics in Computational Intelligence.