LoAS: Fully Temporal-Parallel Datatflow for Dual-Sparse Spiking Neural Networks

Ruokai Yin Yale University
New Haven, USA
ruokai.yin@yale.edu Youngeun Kim Yale University
New Haven, USA
youngeun.kim@yale.edu Di Wu University of Central Florida
Orlando, USA
youngeun.kim@yale.edu Priyadarshini Panda Yale University
New Haven, USA
priya.panda@yale.edu

Abstract

Spiking Neural Networks (SNNs) have gained significant research attention in the last decade due to their potential to drive resource-constrained edge devices. Though existing SNN accelerators offer high efficiency in processing sparse spikes with dense weights, opportunities are less explored in SNNs with sparse weights, i.e., dual-sparsity. In this work, we study the acceleration of dual-sparse SNNs, focusing on their core operation, sparse-matrix-sparse-matrix multiplication (spMspM). We observe that naively running a dual-sparse SNN on existing spMspM accelerators designed for dual-sparse Artificial Neural Networks (ANNs) exhibits sub-optimal efficiency. The main challenge is that processing timesteps, a natural property of SNNs, introduces an extra loop to ANN spMspM, leading to longer latency and more memory traffic. To address the problem, we propose a fully temporal-parallel (FTP) dataflow, which minimizes both data movement across timesteps and the end-to-end latency of dual-sparse SNNs. To maximize the efficiency of FTP dataflow, we propose an FTP-friendly spike compression mechanism that efficiently compresses single-bit spikes and ensures contiguous memory access. We further propose an FTP-friendly inner-join circuit that can lower the cost of the expensive prefix-sum circuits with almost no throughput penalty. All the above techniques for FTP dataflow are encapsulated in LoAS, a Low-latency inference Accelerator for dual-sparse SNNs. With FTP dataflow, compression, and inner-join, running dual-sparse SNN workloads on LoAS demonstrates significant speedup (up to $8.51\times$ ) and energy reduction (up to $3.68\times$ ) compared to running it on prior dual-sparse accelerators.

I Introduction

Spiking Neural Networks (SNNs) have attracted considerable interest as potential energy-efficient substitutes for Artificial Neural Networks (ANNs) [5, 43, 11]. Inspired by the biological neuron, SNNs leverage highly sparse unary-coded ({0,1}) spikes to compute and communicate information [54]. Thus, running SNNs on hardware significantly reduces computation and data movement, making it suitable for edge computing. Therefore, SNNs have been widely used in computer vision tasks, such as image classification [46, 56], optical flow estimation [28], semantic segmentation [21], and object detection [20].

Opportunity. As the need for edge devices with limited memory capacity increases, recent research on SNNs highlights the significance of dual-sparse (both spikes and weights are sparse), which can be achieved by neural pruning techniques [23, 5]. Pruning the weight connections of SNNs has been explored during both training [4, 49] and inference [38]. Certain works have managed to achieve approximately 98% weight sparsity and 90% spike sparsity [23], leveraging the lottery ticket hypothesis [13]. These works have outlined the potential of dual-sparse SNNs in reaching unprecedented energy efficiency and memory footprint with little to no compromise in accuracy.

Refer to caption — Figure 1: An illustrative example of FTP dataflow and LoAS. FTP dataflow is shown along with the prior dataflow design for SNNs. Temporal sequential tick-batch is from SpinalFlow [36], and partially temporal parallel is from PTB [29]. Each arrow loop indicates the processing of one timestep. The vertical line indicates that the processing is in parallel.

Challenge. Although dual-sparse SNNs have made strides with algorithmic advancements, the hardware is not yet catching up to make full use of such dual-sparsity. In general, existing SNN accelerators can be categorized into two main groups. First, multi-core neuromorphic systems¹¹1We are not comparing with those systems due to our focus on single-core dataflow SNN accelerator designs. employ a plethora of cores, even chips, to exploit the inherent parallelism in spiking neuron dynamics [7, 1, 14, 48]. Though capable of capturing the massive parallelism and sparse activities across neurons, multi-core neuromorphic systems require all neurons (including weights) to be mapped on-chip. This undoubtedly wastes a huge amount of hardware resources on the neurons that are not involved in any computations due to the dual-sparsity [40]. Second, dataflow-based SNN accelerators draw inspiration from dataflow-based ANN accelerators and take advantage of the rich data reuse among the array of processing elements [29, 36, 33]. Nonetheless, these designs have mainly focused on processing dense SNN workloads. Currently, there is a lack of dataflow architectures that uniquely target dual-sparsity in SNNs. Table. I summarizes existing dataflow SNN accelerators.

Insight. Though spikes and weights have varying bitwidth, in dual-sparse SNNs, their interactions follow the pattern in sparse-matrix-sparse-matrix multiplication (spMspM), which has been extensively studied in ANNs [15, 42, 19, 18, 39, 9, 64, 41, 51, 62]. However, naively running dual-sparse SNNs on existing spMspM accelerators is inefficient. The reason is multifaceted. First, the timesteps in SNNs complicate the dataflow design for existing spMspM accelerators. spMspM operations in ANNs are triple-nested for-loops [41, 55]. Different spMspM dataflows are obtained by permuting the order of loops. However, in SNNs, the timesteps introduce an extra level of for loop, leading to extra latency and memory traffic. What’s worse, it constrains dataflow dependency and doubles the dataflow design space, delaying the time-to-solution. Second, the asymmetric bitwidth of spikes and weights in SNNs makes it inefficient to use conventional compression formats in ANN spMspM accelerators. Existing ANN spMspM accelerators store sparse matrices with popular compressed formats like compressed sparse row (CSR). These formats usually have multiple bits to record the coordinates of the non-zero values, and so does the hardware designed. Consequentially, using multiple bits to compress single-bit spikes (valued at either 1 or 0) is extremely inefficient for dual-sparse SNNs.

TABLE I: Comparison of LoAS with prior SNN accelerators. S and T denote the spatial and temporal dimensions. Spatial parallelism means PE-level parallelism.

Accelerator Spike Weight Parallel Neuron Sparsity Sparsity support support SpinalFlow[36] ✔ ✘ S LIF PTB[29] ✔ ✘ S+partial-T LIF Stellar[33] ✔ ✘ S+fully-T FS LoAS (ours) ✔ ✔ S+fully-T LIF

Proposal. To solve these problems and unleash the potential of dual-sparse SNNs in the presence of spMspM, we propose fully temporal-parallel (FTP) dataflow, illustrated in Figure 1. FTP dataflow parallelizes all timesteps to avoid complicated dataflow dependency for minimized latency and memory traffic. To maximize the efficiency of FTP dataflow on memory and computation, we design FTP-friendly spike compression and inner-joint mechanism. The proposed compression packs spike along timesteps and can access the relevant memory space in a contiguous manner. The proposed inner-join nearly halves the cost of cumbersome prefix-sum circuits with almost no throughput penalty compared to prior inner-join designs. To validate FTP dataflow, we design LoAS, a Low-latency Inference Accelerator for Dual-Sparse Spiking Neural Networks. Our contributions are listed below:

1.

We observe that SNNs with rich dual-sparsity from both input spikes and weight connections are sub-optimal on existing hardware. SNN hardware usually does not support sparse weights, while ANN spMspM hardware fails to efficiently process timesteps in SNNs with low latency and memory traffic.
2.

To improve the efficiency of processing timesteps, we propose a fully temporal-parallel (FTP) dataflow. FTP avoids extra memory traffic across timesteps and minimizes the latency penalty in processing timesteps sequentially.
3.

To make the most of FTP, we propose FTP-friendly spike compression for efficient yet contiguous memory access and an FTP-friendly inner-join mechanism for low-cost computation with almost no latency penalty.
4.

We build LoAS, a novel architecture that exemplifies the FTP dataflow. With both FTP-friendly compression and inner-join, LoAS is able to achieve high speedup and energy efficiency against other sequential-running spMspM baselines.

The remainder of the text is organized as follows. Section II reviews the background and justifies the motivation. Section III and IV articulates our proposed FTP dataflow and LoAS architecture. Next, Section V and VI evaluate our design. Finally, Section VII and VIII discuss and conclude this work.

II Background and Motivation

II-A Preliminary of SNNs

II-A1 Leaky-Integrate-and-Fire Neuron

The Leaky-Integrate-and-Fire (LIF) neuron is a classical neuron model [8] and widely adopted by prior SNN works [65, 24, 23, 63], thanks to its bio-plausibility and high accuracy. In this work, we focus on accelerating the workloads of dual-sparse SNNs that use LIF neurons.

During inference, each layer has an input spike tensor $A\in\mathbb{U}^{M\times K\times T}$ where $\mathbb{U}\in\{0,1\}$ and a weight matrix defined as $B\in\mathbb{Z}^{K\times N}$ . Here $T$ is the number of total timesteps; $M$ , $N$ , and $K$ are the spatial dimensions of the input and weight matrix. The behavior of an SNN layer can be described below:

Step 1: Sparse Matrix Multiplication Sparse matrix multiplication across all timesteps is performed to obtain the full output matrix $O\in\mathbb{Z}^{M\times N\times T}$ , which will be sent to LIF neurons.

O_{m,n}[t_{i}]=\sum^{K}_{k=0}A_{m,k}[t_{i}]B_{k,n},

(1)

where the $t_{i}$ is the current timestep. With dual-sparsity, sparse matrix multiplication becomes spMspM.

Step 2: LIF firing LIF neurons take the snapshot of $O$ at timestep $t_{i}$ and generate a snapshot of the output spike tensor $C\in\mathbb{U}^{M\times N\times T}$ for current timestep $t_{i}$ :

C_{m,n}[t_{i}]=\left\{\begin{array}[]{lccl}1&&&{X_{m,n}[t_{i}]>v_{th}}\\ 0&&&{\textnormal{else}},\end{array}\right.

(2)

where

X_{m,n}[t_{i}]=O_{m,n}[t_{i}]+U_{m,n}[t_{i-1}].\\

Here, $U[t_{i-1}]$ is the membrane potential that carries over the temporal information from previous timestep $t_{i-1}$ , and $v_{th}$ is the firing threshold, a pre-defined scalar value.

Step 3: Membrane Potential Update After the output spikes are generated, we update the membrane potential that will carry residual information to the next timestep according to the equation below.²²2We focus on the hard reset (membrane potential is reset to zero if there is an output spike of one) in this work. Though there exist other reset schemes, sticking with one of them will not lose generality in the hardware design.

U_{m,n}[t_{i}]=\tau X_{m,n}[t_{i}](1-C_{m,n}[t_{i}]),\\

(3)

where $\tau\in(0,1)$ is the leaky factor. From the above equations, we observe that to generate the output spike matrix $C$ for timestep $t_{i}$ , we need to know the information from the previous timestep $U[t_{i-1}]$ . This brings temporal dependency between output spike matrices across timesteps. The behavior of a LIF neuron can be found in Figure 2.

II-A2 Spike Encoding and SNN Training

One key step in leveraging SNNs in conventional machine learning tasks is encoding the input source data (e.g., image pixels or text embeddings) into spike trains across multiple timesteps. The input spike trains are then sequentially sent to the SNN for processing. Recent SNN works adopt direct encoding (a special case of rate encoding) to achieve high accuracy on conventional computer vision tasks in very few timesteps ( $\leq 4$ ) [57, 65, 23, yin2023workload, 25]. In direct encoding, the source data, instead of being directly converted into spike trains, first goes through one ANN layer. The output from the ANN layer is then converted into spike trains. We will focus on accelerating direct-coded dual-sparse SNNs in this work. The SNNs are trained using backpropagation-through-time (BPTT) [53] with surrogate gradient [37] to achieve very close performance to ANNs on many complex tasks [57, 65].

II-B Distinctive Features and Challenge of SNNs

Several distinctive features make SNNs favorable for low-power edge deployment, but they also come with challenges.

Feature 1: Unary Activation One of the most distinctive features of SNNs is their unary spike activation. More specifically, the SNNs leverage single-bit non-weighted activation to propagate information through layers. The primary benefit of the unary activation is the simplified low-power arithmetic units that they require. As shown in Figure 2, compared to the multiply-accumulate (MAC) of ANNs, SNN only requires simple bitwise-AND and accumulate (AC) operations during inference time.³³3There exist other implementations using multiplexers instead [29, 36]. We focus on using bitwise-AND gates in this work. Without the expensive multipliers [16], the computations for SNNs require extremely low power and area.

Feature 2: Sparse Spike Activity The second feature of SNNs is their highly sparse spike-firing activity. In ANNs, upon completion, MAC results go through the ReLU unit, which filters out non-positive outputs. Different from ANNs, AC results in SNNs go through the Leaky-Integrate-and-Fire (LIF) unit, which only fires (generates an output of 1) when the input is greater than a pre-set threshold. As a result, the output sparsity in SNNs is usually much higher ( $\sim 90\%$ ) [63, 60, 61, 65] than that of ANNs ( $\sim 50\%$ ) [41, 45]. More sparse outputs apparently lead to more computation and memory saving under the context of spMspM acceleration.

Challenge: Repeated Timesteps Despite the aforementioned hardware-friendly features, one main challenge of deploying SNNs on hardware is their intrinsic repeated timesteps. A timestep is the minimum unit of time in SNNs, thus discrete.⁴⁴4Timestep is also called tick [36] or time-point [29] in other works. We follow the naming convention adopted by the latest SNN algorithm works. In one timestep, each neuron needs to complete the AC operations for all inputs, fire a spike if necessary, and update its membrane potential (will be discussed shortly). The SNN needs to run across multiple timesteps to capture the temporal dynamics from the input data, as shown in Figure 2. Running multiple timesteps increases latency and fails to be energy efficient, diluting the advantage of low-power circuits unless we have a specialized architecture design [36].

II-C spMspM Dataflows in SNNs

There are various ways to map spMspM onto hardware, each with unique efficiency [31, 34]. Three different spMspM dataflows have been proposed in existing dual-sparse ANN accelerators: Inner-product (IP) [15, 42, 19, 18], Outer-product (OP) [39, 9, 64, 41], and Gustavson’s (Gust) [51, 62]. In Figure 3, we illustrate these three dataflows in SNNs for two input matrices $A$ and $B$ , and an output matrix $C$ . We also formulate their abstract loop nests on the right-hand side. As we discussed in Section II-B, it is impossible not to consider the multiple timesteps for spMspM operations in SNNs.

Inside the black box in Figure 3, the dataflow is for one timestep, thus identical to ANN dataflow. Outside the black box, multiple input matrices $A$ (blurred) represent the input spike matrices across different timesteps, which need to be processed. Meanwhile, multiple output spike matrices $C$ that have temporal dependency between each other are also generated. Specifically, to accommodate the timesteps in SNNs, we need to consider one more loop dimension ( $t$ dimension) in the original triple-nested for-loop. The $t$ dimension (annotated in the blue box) brings temporal dependency to each output pixel in SNNs. For example, to process the SNN using IP dataflow as shown in Figure 3, we first calculate the output cell at (0,0) position for timestep $0$ ( $C$ [0,0,0]), then instead of moving to the position (0,1), we move on to process the output cell at (0,0) for timestep $1$ ( $C$ [0,0,1]). Since the output cell $C$ [0,0,1] is temporal dependent on the result of the output cell $C$ [0,0,0], we cannot process $C$ [0,0,1] before $C$ [0,0,0].

II-D ANN spMspM Hardware for dual-sparse SNNs

We review existing ANN spMspM accelerators to understand why naively running dual-sparse SNNs on these accelerators is sub-optimal.

Inner-join Design: For the IP dataflow, prior accelerators usually adopt the inner-join-based design [15, 9]. In such designs, non-zero values in rows of matrix A and columns of matrix B are compressed using bitmask representation (a bit string that has 1’s for positions with non-zero values and 0’s otherwise). An inner-join unit scans two bitmasks on the fly to determine if there’s a matched position (both multiplicands are non-zero) and then sends the matched pairs to the compute units. Running dual-sparse SNNs on an inner-join-based design does not require the extra bit-masks for the input spike matrix $A$ (the unary spike train itself can be viewed as a bit-mask). However, as shown in Figure 4, the timesteps will impose multiple extra rounds of running the expensive inner-join units (e.g., occupying roughly 46% of the system-level power [15]), thus incurring high energy cost. Moreover, since the spike trains are used as bit-masks, all the spikes, no matter 1 or 0, are necessary to be fetched from off-chip DRAM. This brings no memory traffic saving on the sparse spike matrix $A$ .

Merger-based Design: Unlike IP dataflow designs that exhibit full output matrix ( $C$ ) reuse, OP and Gust dataflow designs focus on the reuse of input matrix $A$ and $B$ . In OP, each column of $A$ and each row of $B$ will only be transversed once, leading to efficient input data reuse. However, one partial sum is generated at a time and merged later. While these two dataflows have better data reuse on the input matrix, the partial sum matrices (rows) potentially bring more off-chip data traffic. To amortize the large memory traffic of partial sums, some designs implement large and costly mergers (e.g., 38 $\times$ more area than multipliers [64]) to merge as many as partial sum matrices (rows) before sending them back to the off-chip DRAM. Due to the extra $t$ dimension, running dual-sparse SNNs on a merger-based design either requires a more complex merger that is capable of digesting the extra partial sum traffic or incurs more off-chip memory traffic. As shown in Figure 5, for a timestep of four, on average, $4\times$ more partial sum traffic will be induced compared to a single timestep.

II-E Dataflow Architecture for SNNs

SpinalFlow: Temporal Sequential Design. SpinalFlow [36] is the first SNN-tailored accelerator for extracting the efficiency from the single-bit activation and the extremely sparse spike activity. The authors identified the challenge of sequentially processing the entire SNN network through timesteps. To overcome the challenge, SpinalFlow proceeds all timesteps for one layer and then proceeds to the next layer, as shown in Figure 1. SpinalFlow dispatches LIF neurons across different processing elements (PEs) and parallelizes the computation. Within each layer, the timesteps are processed sequentially, as shown in Figure 1. Spinalflow is optimized exclusively for the temporal-coded SNNs that potentially lag in terms of accuracy performance compared to rate-coded SNNs [29]. In this work, we focus on accelerating spMspM for general rate-coded SNNs that yield competitive accuracy as ANNs in various tasks.

PTB: Partially Temporal Parallel. While SpinalFlow’s design is tailored to the temporal-coded SNNs, PTB [29] proposes a general architecture design for the rate-coded SNN. By leveraging the high data-reuse pattern across different PEs in the systolic array architecture [27], PTB breaks the processing of all timesteps into multiple time-windows (each consists of several contiguous timesteps) and run these time-windows in parallel, as shown in Figure 1. PTB parallelly maps multiple time-windows across different columns of the systolic array. The computation of different LIF neurons is also parallelized across the rows of the systolic array. We illustrate this hardware mapping strategy in Figure 6 with details. Though PTB tries to parallelize the processing of timesteps, the parallelization is on the granularity of the time-window. Inside each time-window (column of PEs), the timesteps are still processed sequentially. Consequently, we categorize PTB as a partially temporal parallel design. One unique aspect of LoAS from PTB is that LoAS places the temporal dimension in the inner-most loop, enabling all optimizations.

Prior SNN accelerators with LIF neurons process timesteps in a sequential or partially parallel manner. In this way, as we discussed in (Section II-C & II-D), it is very challenging for those existing SNN designs to have good performance on spMspM SNN acceleration. Thus, we need a spMspM-friendly strategy to process timesteps.

Stellar: Fully Temporal Parallel but with non-LIF neurons. Stellar [33] is another systolic array SNN accelerator which attempts to process timesteps in a fully parallel manner. Nonetheless, Stellar focuses on optimizing for the Few Spikes (FS) neuron [52], as shown in Table I. FS neurons behave differently from LIF neurons by detaching the spike accumulating and firing stages. Therefore, FS neurons naturally do not have temporal dependency among the input data at the spike accumulation stage. This makes fully parallel temporal processing straightforward in Stellar. On the contrary, as discussed in Section II-A, temporal dependency naturally exists in the input data for the LIF neuron, which makes its design space different from the one in Stellar for fully temporal parallel processing. Unlike the widely adopted LIF neurons, supporting FS neurons also requires non-trivial algorithm-hardware codesign, which is out of the scope of this work.

Algorithm 1 Fully Temporal-Parallel dataflow (FTP)

Input:
Input spike matrix $A\in\mathbb{U}^{M\times K\times T}$ ( $\mathbb{U}\in\{0,1\}$ )
Weight matrix $B\in\mathbb{Z}^{K\times N}$
Output:
Output spike matrix $C\in\mathbb{U}^{M\times N\times T}$

1:for

m\in M

2: for

n\in N

3: for

k\in K

4:parallel-for

t\in T

\triangleright

Spatially unrolled

O[m,n,t]\text{ += }A[m,k,t]\times B[k,n]

6: end forparallel-for

t\in T

\triangleright

Spatially unrolled

\indent C[m,n,t]\text{ = }LIF(O[m,n,t])

8: end for

9:end for

III Fully Temporal Parallel Dataflow

We propose a fully temporal-parallel dataflow (FTP) that targets reducing the negative effects of repeatedly processing the timesteps on spMspM accelerators (Section II-D). The proposed FTP is formulated in Algorithm 1.

An SNN-friendly spMspM dataflow should satisfy three goals: (1) avoid as much data refetch as possible across the timesteps; (2) generate as few partial sums as possible on the temporal dimension (timesteps); (3) reduce the latency as much as possible on the temporal dimension to reduce the extra cost of sparsity handling units.

Our first observation is that for all three spMspM dataflows (Section II-C), unless placing the temporal dimension ( $t$ -dim) at the innermost loop, it will bring at least $T$ times more data refetch to the dimensions below, compared to the original dataflow. For example, in OP, if $t$ -dim is placed between $m$ and $n$ , $T$ times more access to $B$ ’s rows is required. If $t$ -dim is placed between $k$ and $m$ , $T$ times more access to $A$ ’s columns and $B$ ’s rows is required. Depending on the on-chip buffer capacity, repeated memory access might lead to more expensive access to the off-chip memory, which opposes goal (1).

Our second observation is that both OP and Gust dataflow are not suitable for dual-sparse SNNs since they oppose goal (2). In OP dataflow, we observe that no matter where we insert the $t$ dimension into the original triple-nested loop, we always produce $T$ times more partial sum matrices compared to the original OP dataflow. The partial sums need to be stored in an on-chip cache till all partial sums along both spatial ( $k$ ) and temporal dimensions ( $t$ -dim) are accumulated. This will add extra memory overhead in OP. The same problem also exists for Gust dataflow. The $t$ -dim will either generate $T$ times more partial sum rows or have $T$ times more access to both $k$ and $n$ dimensions. The last observation is that regardless of the position of $t$ -dim, as long as we process it sequentially, it always incurs $T$ times more processing latency, which opposes goal (3).

Our solution is straightforward but effective. We first choose to position the $t$ -dim at the innermost of the IP dataflow, as given in Algorithm 1. This design choice has several advantages. Firstly, putting the $t$ -dim at the innermost loop ensures that no extra data movement will be incurred (goal (1)). Secondly, since IP dataflow has efficient output reuse, no extra partial sums will be generated on the $t$ -dim (goal (2)). Lastly, we fully parallelize the $t$ -dim and eliminate the latency brought by sequentially processing timesteps. This is equivalent to transforming the for-loop of $t$ into a parallel-for loop [55]. This parallel-for loop parallelizes the operation across different spatial instances, requiring minimum hardware overheads due to only cheap accumulators being duplicated, and timesteps of direct-coded SNNs are small (Section II-A). We later show in the ablation studies that FTP scales well with the increasing timesteps.

IV LoAS

An overview of LoAS is shown in Figure 7. LoAS consists of multiple temporal parallel processing elements (TPPEs) and parallel Leaky-Integrate-Fire units (P-LIFs) that are tailored to run the FTP dataflow; a scheduler that distributes workloads across TPPEs; and a compressor that compresses the output spikes from P-LIFs and writes them back to the on-chip memory. An on-chip SRAM is equipped to capture data reuse.

IV-A Spikes Compression

We first discuss how sparse input spikes (matrix $A$ ) across timesteps are compressed in LoAS. Efficiently compressing matrix $A$ in SNNs necessitates solving two challenges:

How to maximize the compression ratio of 1-bit spikes? Assume that the input spike matrix $A$ has a size of 128 $\times$ 128 for each timestep. Then for either CSR or CSC, we need to use two 7-bit coordinates to compress each 1-bit non-zero spike.⁵⁵5For 128 columns, we need $\log_{2}(128)=7$ bits for coordinates. We neglect the offsets in the discussion, which will further increase the number of bits used for coordinates. Furthermore, SNNs naturally run for multiple timesteps, which means that for the same coordinate, different spike values may occur at different timesteps (e.g., $0$ for T=1&3, and $1$ for T=2&4). To faithfully capture all the non-zero spikes, we need separate coordinate values for each timestep.

How to maintain contiguous memory access of non-zero spikes across timesteps? The FTP dataflow we proposed in Section III requires spatial unrolling of the input spike matrix $A$ across all timesteps beneath the $k$ dimension. Consequently, a dis-contiguous memory layout of $A$ along the $t$ dimension will cause fragmented memory access at all levels of memory hierarchies, leading to higher data movement costs.

To better illustrate these two points, we provide an example in Figure 8. Envisioning that the input spikes sent to the system have the pre-synaptic neuron $a_{0,0}$ (first element of row-0 in matrix $A$ ) firing a spike at $t_{0}$ and $t_{2}$ . As shown in step , to represent this pre-synaptic neuron behavior, a single-bit 1 needs to be stored at row-0, column-0 of matrix $A$ for both timestep 0 and 2 into the memory, shown in the box of ’unpacked real data.’ Then, for each non-zero spike in row-0 of matrix A for each timestep, if we need to use a coordinate value (e.g., 4-bit for CSR) to record its position. We then need $2\times 4=8$ bits to compress 2 bits (2 spikes). The compression efficiency in this case is only $25\%$ . Furthermore, memory access to spikes across different timesteps is discontinuous (sequentially access different rows of $A$ ). We propose the following spikes compression format for LoAS to solve these two challenges. In our method, as shown in step , we pack all the spikes (both 0 and 1) across all timesteps into one continuous data block in the system for each pre-synaptic neuron. In the example of Figure 8, we store a 4-bit value 1010 at the first position of row-0 of matrix $A$ for $a_{0,0}$ and 0111 at the fourth position for $a_{0,3}$ . Since neurons $a_{0,1}$ and $a_{0,2}$ do not spike at any timestep, their packed value would be 0000 (shown in the box of ’packed real data’). We define these neurons as silent neurons.⁶⁶6We follow the same terminology used in [29]. With this strategy, only the non-silent neurons will be treated as non-zero values and stored in the memory for matrix $A$ , as shown in step . In our example, we end up using 4 bits to compress 5 bits. The compression efficiency in this case is $125\%$ .

To accommodate our FTP dataflow, we compress the input spike matrix $A$ in a row-wise manner and use the bitmask format [42, 15, 9] to represent the coordinates of the non-zero values. The bitmask format uses a 1-bit coordinate value for each position in the row. In our example, the bitmask is 1001 since the first and the fourth elements in the row are non-zero. The second and third elements are silent neurons, so we do not store them in the memory (represented by a 0 in bitmask). Following the bitmask, a pointer is stored to provide the starting location of the non-zero values of the row. We call this compressed row: a fiber [62, 34].

The key to our compression method is the ratio of silent neurons in the SNN. Fortunately, empirical studies have shown that SNNs have a significant fraction of silent neurons ( $60\%\sim 70\%$ , as shown in Table II). We further use a similar bitmask-based technique to compress weights in a column-wise manner. Each compressed weight column is also called a fiber.

IV-B Temporal Parallel Processing Elements

The fundamental building blocks of LoAS’s compute engine are Temporal Parallel Processing Elements (TPPEs) and Parallel Leaky-Integrate-Fire units (P-LIFs), which we describe next. Figure 7 also details the design of TPPE. Each TPPE produces the full sum for one output neuron across all timesteps (Line 5 in Algorithm. 1). Before the computation starts, the bitmask (bm-B) of a fiber from weight matrix $B$ (fiber-B) and its non-zero data are read from SRAM and broadcasted into the small bitmask buffers (128 bits in our design) inside each TPPE. The bitmask (bm-A) of fiber from input spike matrix $A$ (fiber-A) is also fetched and sent to the TPPEs. Each TPPE will hold the bitmask for a distinct fiber along the row of $A$ . After the data are loaded, an inner-join operation [15, 9, 18] is performed between the two bitmasks. Depending upon the inner-join result, the matched non-zero data of fiber-A will be fetched from the global cache and sent to the pseudo-accumulator (soon be discussed) to perform the accumulation (AC) operation. After the TPPE completes the full computation of one output neuron, it will send the result to the P-LIF unit to generate output spikes for all timesteps in one shot.

IV-C Inner-join Unit

The inner-join operation has been extensively studied by prior works [15, 9, 18] for spMspM acceleration in ANNs. The inner-join mechanism with prefix-sum circuit has been efficiently implemented with the bitmask representation [15]. In [15], a logical-AND operation is first applied to two bitmasks to get the AND-result, which represents the location where both data are nonzero. The AND-result is then sent to a priority encoder to convert the matched positions into integer values. The matched positios are sent to two separate prefix-sum circuits to get the number of 1s in front of the matched position for each bitmask. This gets the offsets for each non-zero data in the memory.

During the above process, the use of two fast prefix-sum circuits is an expensive operation (taking more than 45% power and area in [15]).⁷⁷7In [15], the design of the prefix-sum circuit is not described. We assume it to be a tree-like prefix-sum circuit with O( $\log(n)$ ) complexity that can run in one clock cycle. $n$ is the size of input and output for the prefix-sum circuit, which is set to 128 in both [15] and our work. To reduce the overhead brought by the prefix-sum circuits, we propose an FTP-friendly inner-join unit that is detailed in Figure 9.

We first observe that in ANNs, the MAC operation requires both inputs to be explicitly known at computation time. Therefore, we need two fast prefix-sum circuits to match the processing speed between two inputs. However, this is not the case with SNNs. In SNNs, we only have two cases for the input ( $1$ or $0$ ), meaning we either accumulate or discard the weight. This provides the opportunity to have an imbalanced processing speed for two inputs at the prefix-sum stage.

In our design, instead of using two fast prefix-sum circuits as in ANNs, we have one fast and one laggy prefix-sum circuit, as shown in Figure 9. Recall that our compression method only fetches the non-silent neurons (that fire at least once across timesteps) from DRAM for $A$ . Thus, as soon as we find a matched position in AND-result, we are confident that the corresponding non-zero value in fiber-B will be accumulated at least once (at least one timestep). Therefore, we can begin accumulating the non-zero value in fiber-B without knowing the exact spike information from fiber-A. In this way, we can ensure the throughput of consuming fiber-B is always high regardless of the processing speed of fiber-A.

In our efficient inner-join unit, each time the fast prefix-sum circuit generates an offset, the corresponding non-zero value of fiber-B will be directly sent to a pseudo-accumulator for accumulation. This mechanism opportunistically presumes the matched non-zero value of fiber-A is all 1s (pre-synaptic neuron fires at all timesteps) to fully leverage the throughput of the fast prefix-sum circuit. Since the non-zero value in fiber-A is not always all 1s, we need a mechanism to ensure that the accumulation results are correct. Instead of using the expensive fast prefix-sum circuit to access and check the matched non-zero value in fiber-A, we use a much simpler circuit to generate the offset of fiber-A. We defined the simpler prefix-sum circuit as the laggy prefix-sum circuit, illustrated on the left of Figure 9. We use a group of adders to sequentially add up the prefix-sum results and store them inside a small buffer. These adders run in parallel, and hence, the latency of generating all the offsets is equal to len(bm-A)/# of adders.

We provide a simple walk-through example in Figure10. We first run the fast prefix-sum circuit; in every cycle, we accumulate the matched non-zero value of fiber B and buffer it together with the matched position in small FIFOs. When the laggy prefix-sum circuit finishes running, a ready signal is sent out. We then check the non-zero value in fiber-A according to the buffered position from FIFO-mp. If the matched value is all 1s, we simply discard the current value in FIFO-B. Otherwise, we need to send the buffered non-zero values of fiber-B from the FIFO-B to the correction accumulators. As illustrated in Figure 10, at cycle 4, we check $a_{2}$ and find its value is 1111. Thus, we simply discard $b_{2}$ . At cycle 5, we check $a_{4}$ and find its value is 1010. Thus, we send $b_{4}$ to the correction accumulator for $t_{1}$ and $t_{3}$ . This example shows the motivation and benefits of using a combination of fast and laggy prefix sums. By having a fast prefix sum, we can consume B at the earliest possible by first accumulating it into the pseudo-accumulator. While waiting for the laggy prefix sum to correct the accumulation results, we can proceed to fetch the next fiber-B’s data into the buffer. This way, the latency of fetching fiber B can be overlapped with the laggy prefix sum and correction to improve the overall throughput. At the same time, replacing one fast prefix sum with a laggy one saves the overall power and area of our TPPE.

IV-D Other Units

After the computation of the pseudo-accumulator completes, its accumulation results are duplicated and sent to each correction accumulator. The correction value inside each accumulator will be subtracted from the pseudo accumulation results for each timestep. Finally, we send the corrected results to the P-LIF units to generate the output spikes. As shown inside the purple box in Figure 7, we spatially unroll the LIF operations so that the output spikes for all timesteps will be generated at once.

LoAS uses a unified global buffer for holding compressed fiber-A and fiber-B with their bitmask representations. We adopt a FiberCache design [62]. A unified shared cache exhibits better utilization compared to separate ones. Each line in the global cache consists of two parts. The first part is the bitmask representation of a fiber, followed by a pointer. The second part is the non-zero values of that fiber. If the line manages to hold all the non-zero values, the pointer will be a NULL pointer. Otherwise, it will point to the location of the line where the rest of the data are held. Each PE will take responsibility for generating one output neuron. Therefore, we use a highly banked global cache to ensure multiple PEs can access their data concurrently. Inside each bank, we fetch as many chunks as possible for one fiber in matrix $A$ and hold them as long as possible to maximally have the data reuse of $A$ . This can be achieved by adopting a replacement policy for the global cache as in [62, 31]. Only one compressed row fiber of matrix $B$ is fetched into the global cache and broadcasted to all TPPEs. We follow a compression unit as [15], where an inverted prefix-sum circuit is used to compress the output spikes and generate their bitmask representations. Similar to the observation in [15], this compression step does need to be performed fast. Therefore, we equip an inverted laggy prefix-sum circuit to perform the compression. The scheduler will be responsible for casting the data to each TPPE through a simple swizzle-switch-based crossbar [47].

V Experimental Methodology

Software Configuration: For the dual-sparse SNNs, we train and compress the AlexNet [26], VGG16 [50], and ResNet19 [17]. We use the open-source toolchains for lottery-ticket-hypothesis (LTH)-based SNN pruning [22, 13]. We set the default timesteps $T$ to 4 across all experiments. We use 15 rounds of LTH searching, and all SNNs are trained towards convergence with similar accuracy as state-of-the-art dense baselines [22]. We further select representative layers from each network to provide single-layer insights. The summary of the workloads is in Table II.

TABLE II: SNN workloads. NL = # of layers. T = Timesteps. AvSp{A, B} = Average sparsity of the matrices{A, B} in(%). AvSpA-origin is the original spike sparsity across timesteps, AvSpA-packed is the density of silent neurons, and AvSpA-packed+FT is the density after fine-tuned preprocessing. M/N/K denotes matrix shape.

SNN NL T AvSpA AvSpA AvSpB origin packed(+FT) AlexNet(A) 7 4 81.2 71.3(76.7) 98.2 VGG16(V) 14 4 82.3 74.1(79.6) 98.2 ResNet19(R) 19 4 68.6 59.6(66.1) 96.8 Layer T,M,N,K A-L4 4,64,256,3456 75.8 63.2(69.7) 98.9 V-L8 4,16,512,2304 88.1 76.5(86.8) 96.8 R-L19 4,16,512,2304 57.9 51.4(55.7) 99.1 T-HFF 4,784,3072,3072 - - (86.8) 96.8

We further use a simple yet effective preprocessing technique: zeroing out all presynaptic neurons that have a low firing activity to further improve the number of silent neurons. We take the trained SNN and mask the neurons with only one output spike throughout all timesteps. We find that with a very small number of fine-tuning ( $<$ 5 epochs), the accuracy can be fully recovered, as shown in Figure 11. Please note that this preprocessing technique aims to maintain the accuracy of the original workload instead of improving it. During hardware execution, the compressor will discard the output neurons that have 0 or only 1 output spike. From Table II, we see that preprocessing effectively creates up to $1.1\times$ more silent neurons.

Hardware Configuration: We evaluate LoAS with the configuration in Table III. In our experiments, we configure the LoAS to support SNNs running with 4 timesteps. We use 16 TPPEs, each with 5 accumulators (1 12-bit pseudo-accumulator and 4 10-bit correction accumulators) and 1 inner-join unit. Inside each inner-join unit, there is 1 fast prefix-sum circuit and 1 laggy prefix-sum circuit. The fast prefix-sum circuit can generate the offsets in a single cycle. The laggy prefix-sum circuit contains 16 adders and a 128-bit buffer. It generates the offset results in 8 cycles. The TPPE also has 2 depth-8 FIFOs (for correction purposes) and 2 128-bit buffers (for holding bitmasks). Finally, a 128-byte buffer is equipped inside the TPPE to hold the non-zero weights from $fiber-B$ . We allocate 256 KB (double-buffered) for the global cache. For our workloads, this memory size is enough to capture good on-chip data reuse and keep all TPPEs busy.

Baseline: As discussed previously, there are currently very limited spMspM accelerators available for dual-sparse SNNs. As a result, we construct our baselines in the following way, We first pick three popular ANN spMspM accelerators that use IP, OP, and Gust dataflow: SparTen [15], GoSPA [9], and Gamma [62]. We then envision that a dual-sparse SNN (with 4 timesteps and 8-bit weights) is naively running (sequentially processing its timesteps) on these accelerators.

TABLE III: Configuration of the LoAS System.

TPPEs 16 TPPEs, 8-bit weight Inner-join unit 16 Inner-join units Global cache 256 KB, 16 banks, 16-way associative Crossbars $16\times 16$ and $16\times 16$ , swizzle-switch based Main memory 128 GB/s over 16 64-bit HBM channels

To be conservative, we place the $t$ dimension at the innermost loop of the original IP, OP, and Gust dataflow.⁸⁸8Adding the $t$ dimension anywhere else will bring more data traffic, thus worsening the performance. We then make essential simplifications for the two accelerators. For example, we remove the multipliers in these designs. To make a fair comparison, we configure all designs to have 16 PEs and the same global SRAM size. We call these three baselines SparTen-SNN, GoSPA-SNN, and Gamma-SNN.

We implement the key components of LoAS and our hardware baselines in RTL and synthesize them using Synopsys DC compiler at 800MHz with 32 nm technology. A 128 GB/s High-Bandwidth Memory (HBM) module is connected to LoAS as the off-chip memory. We use CACTI 7.0 [35] to model the memory components. We built a simulator in Python to model the cycle-level behavior of LoAS and the baselines by tiling the loop and mapping it to hardware.

VI Experimental Results

VI-A Hardware Evaluation

Overall Performances: Figure 12 compares the performance between three dual-sparse SNN accelerator baselines (SparTen-SNN, GoSPA-SNN, and Gamma-SNN) and LoAS (with and without fine-tuned preprocessing) on three SNNs (speedup w.r.t the cycle numbers of the SparTen-SNN).

The first observation is that LoAS significantly outperforms the other three accelerator baselines in all cases, obtaining average speed-ups of $6.79\times$ (vs. SparTen-SNN), $5.99\times$ (vs. GoSPA-SNN), and $3.25\times$ (vs. Gamma-SNN). This is due to LoAS leverages FTP dataflow. The FTP dataflow completely unleashes LoAS from the intra-PE latency penalty of sequentially running the timesteps. It also enables LoAS to invoke less on-chip and off-chip data communications across timesteps.

The second observation is that LoAS’s performance gain is highly correlated with the sparsity of matrix A. This relationship is expected since our workloads are extremely sparse on matrix B; thus, the overall computation is matrix-A-bounded. Consequentially, the performance of two baselines suffers more from sequentially running timesteps through matrix A with less sparsity. However, LoAS will not get this sequentially running penalty. As a result, LoAS achieves from $4.08\times$ speedup (vs. SparTen-SNN) on VGG16 (highest matrix A sparsity) to $8.51\times$ speedup (vs. SparTen-SNN) on ResNet19 (lowest matrix A sparsity). Finally, we observe that with the help of pre-processing (removing the neurons that only spike one time), LoAS further improves the performance by $20\%$ on average. This is because the pre-processing technique helps to increase the density of silent neurons (Section IV-A), which LoAS is able to completely avoid the data communications and computations. Figure 12 also compares the energy efficiency of LoAS and three baselines on different SNN workloads. It is observed that LoAS (with preprocessing) achieves ( $3.68\times$ , $3.09\times$ , $2.40\times$ ), ( $3.17\times$ , $1.50\times$ , $2.33\times$ ), and ( $3.54\times$ , $1.34\times$ , $2.47\times$ ) higher energy efficiency over (SparTen-SNN, GoSPA-SNN, and Gamma-SNN) on Alexnet, VGG16, and ResNet19.

Detailed Analysis: We next explain the performance gains of LoAS. Owing to the FTP dataflow, LoAS has much less on-chip and off-chip memory traffic than the two baselines. As shown in Figure 13, compared to SparTen-SNN (IP), LoAS has $3.93\times$ ( $3.70\times$ ), $3.57\times$ ( $2.22\times$ ), and $4.07\times$ ( $2.24\times$ ) less on-chip SRAM (off-chip DRAM) access on Alexnet, VGG16, and ResNet19, respectively. This behavior is expected since IP dataflow design like SparTen is known for having poor input data reuse. This inefficient input data reuse pattern is exacerbated by the extra temporal dimension ( $t$ -dim) in SNN workloads. While FTP dataflow is a variant of inner-product, it does not incur any extra executions on the $t$ -dim since it parallelizes the $t$ -dim at the inner-most loop.

Not surprisingly, compared to GoSPA-SNN (OP), LoAS still achieves $2.87\times$ ( $4.49\times$ ), $2.19\times$ ( $2.78\times$ ), and $2.98\times$ ( $3.03\times$ ) less on-chip SRAM (off-chip DRAM) access on Alexnet, VGG16, and ResNet19, respectively. This behavior is also expected even though OP dataflow design is known to have excellent input data reuse (on average, GoSPA-SNN has $1.45\times$ less SRAM traffic than SparTen-SNN). The inefficiency for GoSPA-SNN comes from the partial sum (psum) matrices. Because of the extra $t$ -dim in SNNs, the size of psum matrices expands with the number of timesteps. GoSPA’s design allocates a small on-chip memory for the psum. The psum matrices that cannot fit on-chip must be written to off-chip DRAM and read back later for reduction. This incurs significant off-chip memory traffic.

Finally, compared to Gamma-SNN (Gust), LoAS is able to achieve $2.16\times$ , $1.76\times$ , and $1.91\times$ less DRAM accesses. This result is aligned with Gust dataflow’s ability to reduce off-chip partial row accesses through on-chip SRAM and mergers. While reducing the DRAM accesses, Gamma’s SRAM accesses are exacerbated by the $t$ -dim in SNNs. This ends up with on average $13.4\times$ more SRAM traffic than LoAS.

To better visualize the aforementioned analysis, we provide a memory traffic breakup in Figure 14 for the three SNN layers in Table II. As shown in the figure, SparTen-SNN has the largest input off-chip traffic, and GoSPA-SNN has the largest psum off-chip traffic across all workloads. Among the three baselines, Gamma-SNN has the smallest off-chip traffic footprint due to Gust dataflow’s on-chip reuse of partial rows. GoSPA-SNN has the largest off-chip traffic for compressed format due to its CSR format for each spike. We notice that LoAS has slightly larger ( $2.1\times$ ) off-chip traffic for compressed format compared to SparTen-SNN. This is because we need extra bitmasks to mark the position of non-silent neurons, while in SparTen-SNN, we can directly leverage the input spike trains. Nevertheless, this overhead is negligible compared to LoAS’s saving on off-chip traffic for other quantities. Figure 14 also provides the normalized SRAM cache miss rate for the layer workload in ResNet19. SparTen-SNN has a $16\times$ higher miss rate(1.47%) compared to LoAS. GoSPA-SNN has the lowest miss rate due to its Output-stationary dataflow. However, the tradeoff is the higher off-chip traffic of psums. Gamma-SNN has a higher SRAM miss rate than GoSPA-SNN and LoAS. The reason is that the extra $t$ -dim enlarges the partial row traffic by $t$ times. Some of the extra traffic cannot be held in the on-chip SRAM, thus leading to the cache eviction. Overall, the cache miss rate results align with the off-chip traffic trends. Since we set all the baselines to have the same global cache size, the reduction in the memory traffic reflects LoAS’s improvement in both speedup and energy efficiency.

TABLE IV: Area and Power breakdown of LoAS (left) and one TPPE (right).

Components Area (mm²) Power (mW) TPPE units Area Power 16 TPPEs 0.96 45.1 Accumulators 2e-3 0.16 16 PLIFs 0.02 1.2 Fast Prefix 0.04 1.46 Global cache 0.80 124.5 Laggy Prefix 5e-3 0.32 Others 0.30 18.1 Others 0.01 0.88 Total 2.08 188.9 TPPE total 0.06 2.82

Area and Power:Table IV shows the area and power breakdown of LoAS with the configuration in Table III. Inside each TPPE, one single fast prefix-sum circuit dominates both the area (66.7%) and power (51.8%). Original SparTen[15] even requires two fast prefix-sum circuits for both inputs and weights.⁹⁹9This is not the case in SparTen-SNN. Since the input spikes are bitmasks and data at the same time, thus SparTen-SNN only requires one fast prefix-sum circuit. Thanks to the laggy prefix-sum circuits (8.3% of area and 11.4% of power) we proposed, LoAS only requires one fast prefix-sum circuit inside each TPPE. At the system level, the global SRAM cache dominates both the power and area, which aligns with the previous works [31, 34, 62]. Figure 15 provides a visualization of the power breakup.

VI-B Ablation Studies

Temporal Scalability Studies: In our experimental settings, we configured the TPPE inside LoAS to run the SNNs with 4 timesteps. Most state-of-the-art SNN algorithms [10, 12] usually use a timestep equal to or less than 8. So, we want to understand how TPPE scales with the timesteps. Figure 16(a) shows that TPPE scales well with the timesteps. The reason is that all TPPE components other than accumulators and the input data buffer are agnostic to the number of timesteps. Even at 16 timesteps, the TPPE only increases its area (power) by 1.37 $\times$ (1.25 $\times$ ) compared to 4 timesteps. We also showcase how the ratio of silent neurons in VGG16 scales with the number of timesteps. Figure 16(b) shows that with the help of the pre-processing technique, even at the timestep of 8, we can still have a similar ratio of silent neurons as the timestep of 4. However, it is very likely to have fewer silent neurons when we have even larger timesteps ( $>8$ ). This is one of the challenge that LoAS needs to face when scaling up on the number of timesteps.

Scalability Study: Figure 17 further shows how the overall performance of LoAS scales with different quantities. We first test LoAS running on VGG16 with average sparsity of B (weight) at $98.2\%$ (High), $68.4$ (Medium), and $25\%$ (Low). The result shows that LoAS’s performance is highly sensitive to the sparsity level of B. When we scale the sparsity from $98.2\%$ to $25\%$ , the performance scales down by roughly $88\%$ . We also find that LoAS’s performance scales pretty well on timesteps. LoAS only loses roughly $14\%$ of performance when increasing the number of timesteps by $2\times$ . Finally, we test LoAS’s scalability on layer size. We compare one layer from VGG16 and the hidden feed-forward (HFF) layer from SpikeTransformer [58]. The results show that LoAS scales pretty well, even on the layer with a larger parameter size.

Dual-sparse SNN vs. Dual-sparse ANN: In this work, we focus on providing insights for the community on how the spMspM acceleration works on SNNs. However, it is unavoidable to discuss the comparison between SNNs and ANNs. In Figure 18, we show the comparison of normalized energy efficiency and memory traffic between SNNs (LoAS) and ANNs (SparTen [15]) and Gamma [62] running VGG16 workload. We use the VGG16 workload in Table II for LoAS. ANN-version of VGG16 has 8-bit weights ( $98.2\%$ sparsity) and activations ( $43.9\%$ sparsity). Overall, the SNN running on LoAS has roughly $2.5\times$ and $1.2\times$ energy efficiency compared to the ANNs running on SparTen and Gamma, respectively. We observe that around $60\%$ of energy contributes to the data movement for both networks. We, therefore, also include the DRAM and SRAM traffic comparison in Figure 18. It shows that SNNs, on average, have $\sim 60\%$ less memory traffic compared to SparTen-ANN. The less memory traffic comes from less input bitwidth (4-bit vs. 8-bit) and higher input sparsity ( $79.6\%$ vs. $43.9\%$ ), thanks to SNN’s features of unary activation and sparse spike activity (II-B). Not surprisingly, Gamma-ANN has lower overall DRAM accesses compared to LoAS due to its Gust dataflow [62]. The tradeoff is $3.5\times$ more SRAM traffic, which explains why the LoAS has a slightly higher overall energy efficiency.

Dual-sparse SNN vs. Dense SNN: To show the benefits of dual-sparsity in SNNs, we compare LoAS with the prior dense SNN systolic-array accelerators, PTB [29] and Stellar [33], running dense VGG16 with 4 timesteps. For a fair comparison, we set the array size for PTB to be $16\times 4$ , which generates 16 full-sum outputs for 4 timesteps in parallel (same as LoAS). We further configure Stellar to the same array size. We leverage ScaleSim [44] to estimate both baselines’ memory traffic and cycle counts. We show the comparison in Figure 19. We first observe that LoAS has roughly $6\times$ higher energy efficiency compared to PTB, mainly resulting from the $3\times$ ( $12.5\times$ ) less DRAM (SRAM) traffic. Compared to Stellar, LoAS has roughly $2.5\times$ higher energy efficiency, as well as the $2.7\times$ ( $6.6\times$ ) less DRAM (SRAM) traffic. We also observe that LoAS has $46.9\times$ speedup against PTB. This is primarily due to the data sparsity and the difference between PTB’s partially temporal parallel (Section II-E) and LoAS’s fully temporal parallel mechanism. We observe that Stellar outperforms PTB across all matrices. This is mainly due to Stellar’s optimized spatiotemporal row-stationary dataflow and its spike-skipping technique. However, compared to Stellar, we are still able to achieve roughly $7.1\times$ speedup due to LoAS’s capability to leverage the dual-sparsity. Please note that we do not compare with the SpinalFlow [36] due to its temporal encoding achieves limited accuracy on challenging learning tasks [29, 6].

VII Related Work

Except for the prior SNN dense accelerator works we discussed in Section II-E, there also exists prior works that try to leverage the sparsity in SNNs. In [3], a neuron filter unit is leveraged to only fetch the weight if there is a 1-spike. However, dual-sparsity (both spike and weight sparsity) is not considered. In [2], the dual-sparsity of SNN is considered to skip the unmatched computation. However, the weights and spikes are fetched in a dense format without any compression from the off-chip memory, thus failing to save data movement costs. In this work, LoAS leverages the dual-sparsity in SNNs from both computation and data movement.

As we discussed, PTB processes the timesteps in a partially parallel manner. Even if one re-configures the PTB to run all timesteps in parallel (time-window=1), it still differs from LoAS in the loop ordering. In PTB’s loop ordering, $t$ -dim is placed between $m$ -dim and $n$ -dim, while LoAS places the $t$ -dim in the inner-most loop. As discussed in Section III, LoAS’s loop ordering brings more efficiency in spMspM operation. Moreover, PTB targets accelerating workloads with time-series data from DVS sensors [30], where the timestep is usually large ( $>100$ ). On our workloads, where the timesteps are small ( $<8$ ), PTB experiences low hardware utilization. In [32], processing timesteps in parallel is also studied. However, they target the temporal-coded SNN workloads, and the loop ordering is not discussed. Finally, as discussed in Section II-E, Stellar [33] is another work that also tries to process timesteps in parallel. However, it targets the non-LIF, FS-coded SNNs and does not support the dual-sparsity.

VIII Conclusion

In this work, we observe that naively running dual-sparse SNNs on existing spMspM accelerators exhibits sub-optimal efficiency due to the latency and memory traffic penalty brought by processing timesteps. To improve the efficiency, we propose a fully temporal-parallel dataflow (FTP), which avoids the above problems. To maximize the benefits of FTP, we propose FTP-friendly spike compression and inner-join mechanism. We also build LoAS, a novel architecture that exemplifies the FTP dataflow. With the help of both FTP-friendly compression and inner-join, LoAS demonstrates significant speedup (up to $8.51\times$ ) and energy reduction (up to $3.68\times$ ) compared to prior dual-sparse accelerator baselines.

References

[1] F. Akopyan, J. Sawada, A. Cassidy, R. Alvarez-Icaza, J. Arthur, P. Merolla, N. Imam, Y. Nakamura, P. Datta, G.-J. Nam et al., “Truenorth: Design and tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip,” IEEE transactions on computer-aided design of integrated circuits and systems, vol. 34, no. 10, pp. 1537–1557, 2015.
[2] Q. Chen, C. Gao, and Y. Fu, “Cerebron: a reconfigurable architecture for spatiotemporal sparse spiking neural networks,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 30, no. 10, pp. 1425–1437, 2022.
[3] Q. Chen, G. He, X. Wang, J. Xu, S. Shen, H. Chen, Y. Fu, and L. Li, “A 67.5 $\mu$ j/prediction accelerator for spiking neural networks in image segmentation,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 69, no. 2, pp. 574–578, 2021.
[4] Y. Chen, Z. Yu, W. Fang, T. Huang, and Y. Tian, “Pruning of deep spiking neural networks through gradient rewiring,” arXiv preprint arXiv:2105.04916, 2021.
[5] D. V. Christensen et al., “2022 roadmap on neuromorphic computing and engineering,” Neuromorphic Computing and Engineering, vol. 2, no. 2, p. 022501, 2022.
[6] I. M. Comsa, K. Potempa, L. Versari, T. Fischbacher, A. Gesmundo, and J. Alakuijala, “Temporal coding in spiking neural networks with alpha synaptic function,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 8529–8533.
[7] M. Davies, N. Srinivasa, T.-H. Lin, G. Chinya, Y. Cao, S. H. Choday, G. Dimou, P. Joshi, N. Imam, S. Jain et al., “Loihi: A neuromorphic manycore processor with on-chip learning,” Ieee Micro, vol. 38, no. 1, pp. 82–99, 2018.
[8] P. Dayan and L. F. Abbott, Theoretical neuroscience: computational and mathematical modeling of neural systems. MIT press, 2005.
[9] C. Deng, Y. Sui, S. Liao, X. Qian, and B. Yuan, “Gospa: an energy-efficient high-performance globally optimized sparse convolutional neural network accelerator,” in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2021, pp. 1110–1123.
[10] S. Deng, Y. Li, S. Zhang, and S. Gu, “Temporal efficient training of spiking neural network via gradient re-weighting,” arXiv preprint arXiv:2202.11946, 2022.
[11] W. Fang, Y. Chen, J. Ding, Z. Yu, T. Masquelier, D. Chen, L. Huang, H. Zhou, G. Li, and Y. Tian, “Spikingjelly: An open-source machine learning infrastructure platform for spike-based intelligence,” Science Advances, vol. 9, no. 40, p. eadi1480, 2023.
[12] W. Fang, Z. Yu, Y. Chen, T. Huang, T. Masquelier, and Y. Tian, “Deep residual learning in spiking neural networks,” Advances in Neural Information Processing Systems, vol. 34, pp. 21 056–21 069, 2021.
[13] J. Frankle and M. Carbin, “The lottery ticket hypothesis: Finding sparse, trainable neural networks,” arXiv preprint arXiv:1803.03635, 2018.
[14] S. B. Furber, F. Galluppi, S. Temple, and L. A. Plana, “The spinnaker project,” Proceedings of the IEEE, vol. 102, no. 5, pp. 652–665, 2014.
[15] A. Gondimalla, N. Chesnut, M. Thottethodi, and T. Vijaykumar, “Sparten: A sparse tensor accelerator for convolutional neural networks,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019, pp. 151–165.
[16] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “Eie: Efficient inference engine on compressed deep neural network,” ACM SIGARCH Computer Architecture News, vol. 44, no. 3, pp. 243–254, 2016.
[17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
[18] K. Hegde, H. Asghari-Moghaddam, M. Pellauer, N. Crago, A. Jaleel, E. Solomonik, J. Emer, and C. W. Fletcher, “Extensor: An accelerator for sparse tensor algebra,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019, pp. 319–333.
[19] K. Hegde, J. Yu, R. Agrawal, M. Yan, M. Pellauer, and C. Fletcher, “Ucnn: Exploiting computational reuse in deep neural networks via weight repetition,” in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2018, pp. 674–687.
[20] S. Kim, S. Park, B. Na, and S. Yoon, “Spiking-yolo: spiking neural network for energy-efficient object detection,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 07, 2020, pp. 11 270–11 277.
[21] Y. Kim, J. Chough, and P. Panda, “Beyond classification: Directly training spiking neural networks for semantic segmentation,” Neuromorphic Computing and Engineering, vol. 2, no. 4, p. 044015, 2022.
[22] Y. Kim et al., “Exploring lottery ticket hypothesis in spiking neural networks,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XII. Springer, 2022, pp. 102–120.
[23] Y. Kim, Y. Li, H. Park, Y. Venkatesha, R. Yin, and P. Panda, “Exploring lottery ticket hypothesis in spiking neural networks,” in European Conference on Computer Vision. Springer, 2022, pp. 102–120.
[24] Y. Kim and P. Panda, “Revisiting batch normalization for training low-latency deep spiking neural networks from scratch,” Frontiers in neuroscience, 2021.
[25] Y. Kim, H. Park, A. Moitra, A. Bhattacharjee, Y. Venkatesha, and P. Panda, “Rate coding or direct coding: Which one is better for accurate, robust, and energy-efficient spiking neural networks?” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 71–75.
[26] A. Krizhevsky et al., “Imagenet classification with deep convolutional neural networks,” NeurIPS, 2012.
[27] H.-T. Kung, “Why systolic architectures?” Computer, vol. 15, no. 1, pp. 37–46, 1982.
[28] C. Lee, A. K. Kosta, A. Z. Zhu, K. Chaney, K. Daniilidis, and K. Roy, “Spike-flownet: event-based optical flow estimation with energy-efficient hybrid neural networks,” in European Conference on Computer Vision. Springer, 2020, pp. 366–382.
[29] J.-J. Lee, W. Zhang, and P. Li, “Parallel time batching: Systolic-array acceleration of sparse spiking neural computation,” in 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2022, pp. 317–330.
[30] H. Li et al., “Cifar10-dvs: an event-stream dataset for object classification,” Frontiers in neuroscience, vol. 11, p. 309, 2017.
[31] Z. Li, J. Li, T. Chen, D. Niu, H. Zheng, Y. Xie, and M. Gao, “Spada: Accelerating sparse matrix multiplication with adaptive dataflow,” in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2023, pp. 747–761.
[32] F. Liu, W. Zhao, Z. Wang, Y. Chen, T. Yang, Z. He, X. Yang, and L. Jiang, “Sato: spiking neural network acceleration via temporal-oriented dataflow and architecture,” in Proceedings of the 59th ACM/IEEE Design Automation Conference, 2022, pp. 1105–1110.
[33] R. Mao, L. Tang, X. Yuan, Y. Liu, and J. Zhou, “Stellar: Energy-efficient and low-latency snn algorithm and hardware co-design with spatiotemporal computation,” in 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2024, pp. 172–185.
[34] F. Muñoz-Martínez, R. Garg, M. Pellauer, J. L. Abellán, M. E. Acacio, and T. Krishna, “Flexagon: A multi-dataflow sparse-sparse matrix multiplication accelerator for efficient dnn processing,” in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, 2023, pp. 252–265.
[35] N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, “Cacti 6.0: A tool to model large caches,” HP laboratories, 2009.
[36] S. Narayanan, K. Taht, R. Balasubramonian, E. Giacomin, and P.-E. Gaillardon, “Spinalflow: An architecture and dataflow tailored for spiking neural networks,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2020, pp. 349–362.
[37] E. O. Neftci, H. Mostafa, and F. Zenke, “Surrogate gradient learning in spiking neural networks: Bringing the power of gradient-based optimization to spiking neural networks,” IEEE Signal Processing Magazine, vol. 36, no. 6, pp. 51–63, 2019.
[38] E. O. Neftci, B. U. Pedroni, S. Joshi, M. Al-Shedivat, and G. Cauwenberghs, “Stochastic synapses enable efficient brain-inspired learning machines,” Frontiers in neuroscience, vol. 10, p. 241, 2016.
[39] S. Pal, J. Beaumont, D.-H. Park, A. Amarnath, S. Feng, C. Chakrabarti, H.-S. Kim, D. Blaauw, T. Mudge, and R. Dreslinski, “Outerspace: An outer product based sparse matrix multiplication accelerator,” in 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2018, pp. 724–736.
[40] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “Scnn: An accelerator for compressed-sparse convolutional neural networks,” ACM SIGARCH computer architecture news, vol. 45, no. 2, pp. 27–40, 2017.
[41] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “Scnn: An accelerator for compressed-sparse convolutional neural networks,” ACM SIGARCH computer architecture news, vol. 45, no. 2, pp. 27–40, 2017.
[42] E. Qin, A. Samajdar, H. Kwon, V. Nadella, S. Srinivasan, D. Das, B. Kaul, and T. Krishna, “Sigma: A sparse and irregular gemm accelerator with flexible interconnects for dnn training,” in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2020, pp. 58–70.
[43] K. Roy, A. Jaiswal, and P. Panda, “Towards spike-based machine intelligence with neuromorphic computing,” Nature, 2019.
[44] A. Samajdar, J. M. Joseph, Y. Zhu, P. Whatmough, M. Mattina, and T. Krishna, “A systematic methodology for characterizing scalability of dnn accelerators using scale-sim,” in 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2020, pp. 58–68.
[45] A. Sarma, S. Singh, H. Jiang, A. Pattnaik, A. K. Mishra, V. Narayanan, M. T. Kandemir, and C. R. Das, “Exploiting activation based gradient output sparsity to accelerate backpropagation in cnns,” arXiv preprint arXiv:2109.07710, 2021.
[46] A. Sengupta, Y. Ye, R. Wang, C. Liu, and K. Roy, “Going deeper in spiking neural networks: Vgg and residual architectures,” Frontiers in neuroscience, vol. 13, p. 95, 2019.
[47] K. Sewell, R. G. Dreslinski, T. Manville, S. Satpathy, N. Pinckney, G. Blake, M. Cieslak, R. Das, T. F. Wenisch, D. Sylvester et al., “Swizzle-switch networks for many-core systems,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 2, no. 2, pp. 278–294, 2012.
[48] L. Shi, J. Pei, N. Deng, D. Wang, L. Deng, Y. Wang, Y. Zhang, F. Chen, M. Zhao, S. Song et al., “Development of a neuromorphic computing system,” in 2015 IEEE international electron devices meeting (IEDM). IEEE, 2015, pp. 4–3.
[49] Y. Shi, L. Nguyen, S. Oh, X. Liu, and D. Kuzum, “A soft-pruning method applied during training of spiking neural networks for in-memory computing applications,” Frontiers in neuroscience, vol. 13, p. 405, 2019.
[50] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv:1409.1556, 2014.
[51] N. Srivastava, H. Jin, J. Liu, D. Albonesi, and Z. Zhang, “Matraptor: A sparse-sparse matrix multiplication accelerator based on row-wise product,” in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2020, pp. 766–780.
[52] C. Stöckl and W. Maass, “Optimized spiking neurons can classify images with high accuracy through temporal coding with two spikes,” Nature Machine Intelligence, vol. 3, no. 3, pp. 230–238, 2021.
[53] P. J. Werbos, “Backpropagation through time: what it does and how to do it,” Proceedings of the IEEE, vol. 78, no. 10, pp. 1550–1560, 1990.
[54] D. Wu, J. Li, R. Yin, H. Hsiao, Y. Kim, and J. San Miguel, “Ugemm: Unary computing architecture for gemm applications,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2020, pp. 377–390.
[55] Y. N. Wu, P.-A. Tsai, A. Parashar, V. Sze, and J. S. Emer, “Sparseloop: An analytical approach to sparse tensor accelerator modeling,” in 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2022, pp. 1377–1395.
[56] Y. Wu, L. Deng, G. Li, J. Zhu, and L. Shi, “Spatio-temporal backpropagation for training high-performance spiking neural networks,” Frontiers in neuroscience, vol. 12, p. 331, 2018.
[57] Y. Wu, L. Deng, G. Li, J. Zhu, Y. Xie, and L. Shi, “Direct training for spiking neural networks: Faster, larger, better,” in Proceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 1311–1318.
[58] M. Yao, J. Hu, Z. Zhou, L. Yuan, Y. Tian, B. Xu, and G. Li, “Spike-driven transformer,” Advances in neural information processing systems, vol. 36, 2024.
[59] R. Yin et al., “Sata: Sparsity-aware training accelerator for spiking neural networks,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 42, no. 6, pp. 1926–1938, 2022.
[60] R. Yin et al., “Workload-balanced pruning for sparse spiking neural networks,” IEEE Transactions on Emerging Topics in Computational Intelligence, 2024.
[61] R. Yin, Y. Li, A. Moitra, and P. Panda, “Mint: Multiplier-less integer quantization for energy efficient spiking neural networks,” in 2024 29th Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, 2024, pp. 830–835.
[62] G. Zhang, N. Attaluri, J. S. Emer, and D. Sanchez, “Gamma: Leveraging gustavson’s algorithm to accelerate sparse matrix multiplication,” in Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2021, pp. 687–701.
[63] W. Zhang and P. Li, “Temporal spike sequence learning via backpropagation for deep spiking neural networks,” NeurIPS, 2020.
[64] Z. Zhang, H. Wang, S. Han, and W. J. Dally, “Sparch: Efficient architecture for sparse matrix multiplication,” in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2020, pp. 261–274.
[65] H. Zheng et al., “Going deeper with directly-trained larger spiking neural networks,” in AAAI, 2021.